JPS61156295A

JPS61156295A - Voice recognition equipment

Info

Publication number: JPS61156295A
Application number: JP59276799A
Authority: JP
Inventors: 進原
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1984-12-28
Filing date: 1984-12-28
Publication date: 1986-07-15

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は音声を自動的に認識しうる音声認識装置に関す
るものである。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a speech recognition device that can automatically recognize speech.

[Conventional technology]

従来の音声認識の一方法としてパタンマ・ソチング法が
ある。この方法は、標準となる幾つかの音声のパワース
ペクトルから音声標準パタンを作成し、これと入力音声
のパワースペクトルから作成した入力音声パタンとの比
較操作、すなわち１．＜タンマツチングを行ない、相互
に異なる度合を表わす値（以下「相異度」という）を算
出し、最も相異の少ない音声標準パタン、すなわち、相
異度が最小となる音声標準パタンを同一の音声と判定す
る方法である。One of the conventional speech recognition methods is the pattern recognition method. This method involves creating a standard voice pattern from the power spectra of several standard voices, and comparing this with an input voice pattern created from the power spectrum of the input voice, namely 1. <Tan matching is performed to calculate the value representing the degree of mutual difference (hereinafter referred to as "difference degree"), and the voice standard pattern with the least difference, that is, the voice standard pattern with the minimum degree of dissimilarity, is compared to the same voice. This is a method of determining.

ここで音声の対象として単音節に着目した場合、単音節
標準パタンとは、発声の始端から終端にわたって音声の
周波数帯域別のパワー成分を数値化した時系列特徴ベク
トルである。また、単音節人力パタンも時系列特徴ベク
トルであって、入力音声パタンをａ、単音節標準パタン
をｂとすると、相異度ｄは次の式で与えられる。When focusing on monosyllables as speech targets, the monosyllable standard pattern is a time-series feature vector that digitizes the power components of each frequency band of speech from the beginning to the end of the utterance. Further, the monosyllabic human pattern is also a time-series feature vector, and when the input speech pattern is a and the monosyllabic standard pattern is b, the degree of dissimilarity d is given by the following equation.

ｄ−Σｌａｒ　　ｂｒｌただし　ａ−（”＋−”ｚ、”　・ｌ　　ａＲ）ｂ−（
ｂ、、ｂ、、・・・、ｂ＊）〔発明が解決しようとする問題点〕パタンマツチングを用いた実用性の高い音声認識方法と
してＤＰマツチングが用いられているが、単音節を対象
とした音声認識では、「す」と「マ」、「二」と「ミ」
、「夕」と「ハ」などが誤認識しやすいという問題があ
った。d-Σlar brl However, a-("+-"z," ・l aR) b-(
b,,b,,...,b*) [Problems to be solved by the invention] DP matching is used as a highly practical speech recognition method using pattern matching, but it only targets monosyllables. In speech recognition, "su" and "ma", "two" and "mi"
, "Yu" and "Ha" were easily misrecognized.

また、発声に伴う調音器官の動作等を捕らえて認識精度
を高める方法としては、特開昭５８−１５０９９７号に
開示された口蓋接触等の情報を用いたものがある。しか
し、この口蓋接触等の情報を得るには、話者に装着する
器具を多く必要とし、またその着脱時の手間などが煩雑
であるという問題があった。Furthermore, as a method for improving recognition accuracy by capturing the movements of articulators associated with vocalization, there is a method using information such as palate contact, etc., disclosed in Japanese Patent Application Laid-open No. 150997/1983. However, in order to obtain information such as palate contact, a large number of devices are required to be attached to the speaker, and the time and effort required to attach and detach the devices is troublesome.

ここに考察するに、話者は、声帯振動、呼気の乱流等を
音源とし、口蓋、舌、唇などの調音器官を用いて発声す
る。発声中の口蓋、舌、唇などの調音器官は、連続的に
、開閉等の変動を繰り返すが、これらは音響的共振系と
みなすことができる。Considered here, a speaker uses vocal cord vibration, exhaled air turbulence, etc. as a sound source, and uses articulatory organs such as the palate, tongue, and lips to produce speech. During vocalization, articulatory organs such as the palate, tongue, and lips continuously repeat fluctuations such as opening and closing, and these can be considered as an acoustic resonant system.

ここで話者が発声する場合、特に破裂音等の口の動きは
、発声上重要な要因であり、口の開閉を情報として捕ら
えることは重要事項である。しかるに、マ行破裂音等で
は、唇を閉じた時点で、通常音声を発していないか、ま
たは音声パワーが微弱であり、情報として捕らえること
は困難であった。When a speaker utters a voice, mouth movements such as plosives are an important factor in the utterance, and it is important to capture the opening and closing of the mouth as information. However, in the case of plosive sounds such as M-line plosives, the sound is usually not emitted at the time the lips are closed, or the sound power is weak, making it difficult to capture it as information.

また、口蓋、舌、唇などを音響的共振系と考え、外部に
音波発振器を備え、話者の口唇周辺に音波を放射して、
口蓋、舌、唇等の音響的共振系からの反響音を捕らえて
、唇の開閉等の調音器官の変動情報をも得る方法が考え
られるが、この方法では、マイク、発振器２口唇の相互
の位置関係によって特性が変化するという問題があった
。また、マイクと発振器とを一体化した構造にした場合
には発振器の機械振動がマイクに伝達するという問題が
あった。In addition, considering the palate, tongue, lips, etc. as an acoustic resonant system, an external sound wave oscillator is installed to radiate sound waves around the speaker's lips.
One possible method is to capture the echoes from the acoustic resonance system of the palate, tongue, lips, etc., and obtain information on the fluctuations of the articulatory organs such as the opening and closing of the lips. There was a problem that the characteristics changed depending on the positional relationship. Furthermore, when the microphone and the oscillator are integrated, there is a problem in that the mechanical vibrations of the oscillator are transmitted to the microphone.

[Means for solving problems]

このような問題点を解決するために本発明は、話者の口
唇等に対して音波を放射する音波発振器と、この音波発
振器と一体に構成され音波発振器から放射された音波に
より話者の口唇等に入射して生しる反響音とともに話者
の発声を検出する音声入力部と、この音声入力部から入
力される音声信号の入力時間間隔と絶対音波発振器の発
振時間間隔とが重ならないよう時間設定して音波発振器
の発振と音声信号の入力とを切り替える音声入力切替部
と、音声パタン作成部に接続され音声が入力していない
とき振動雑音を記憶する振動雑音記憶部とを備えるよう
にしたものである。In order to solve such problems, the present invention includes a sound wave oscillator that emits sound waves to the lips of a speaker, and a sound wave oscillator that is integrated with the sound wave oscillator and that emits sound waves to the lips of the speaker. an audio input section that detects the speaker's utterances along with the echoes generated by the input to the audio input section; The device includes an audio input switching unit that switches between oscillation of the sonic oscillator and audio signal input by setting a time, and a vibration noise storage unit that is connected to the audio pattern creation unit and stores vibration noise when no audio is input. This is what I did.

[Effect]

本発明においては、音声が入力された時に、その音声信
号を振動雑音記憶部に記憶された数値を用いて補正して
入力音声パタンおよび音声標準パタンを作成する。In the present invention, when a voice is input, the voice signal is corrected using numerical values stored in the vibration noise storage section to create an input voice pattern and a voice standard pattern.

〔Example〕

本発明に係わる音声認識装置の一実施例を第１図に示す
。第１図において、１は話者の口唇等に対して音波を放
射する音波発振器、２は口蓋などの音響的共振系、５は
音波発振器１と一体に構成され音波発振器１から放射さ
れた音波により話者の口唇等に入射して生じる反響音３
とともに話者の発声４を検出する音声入力部としてのマ
イク、６は反響音３および音声４をデジタル信号の音声
信号に変換するアナログデジタル変換器、７はマイク５
に入力される音声信号の入力時間間隔と音９　　波発振
器１の発振時間間隔とが重ならないよう時間設定して音
波発振器１の発振と音声４の入力とを切り替えるととも
に音声信号ａを出力する音声入力切替部、８は音声信号
ａを特徴ベクトルの時系列としての入力音声パタンおよ
び音声標準パタンに変換する音声パタン作成部、９は音
声パタン作成部８に接続され音声が入力していないとき
振動雑音を記憶する振動雑音記憶部、１０は複数個の音
声標準パタンを記憶する標準パタン登録部、１１はデジ
タル化された音声信号ａから作成された入力音声パタン
を登録する入力パタン登録部、１２は標準パタン登録部
１０に登録された単音節標準パタンと入力パタン登録部
１１に登録された入力音声パタンとを転送するパタン転
送部、１３はパタン転送部１２から転送された入力音声
パタンと単音節標準パタンとの認識照合を行なうＤＰマ
ツチング部、１４は入力音声パタンの候補を選択する出
力判定部である。An embodiment of a speech recognition device according to the present invention is shown in FIG. In FIG. 1, 1 is a sound wave oscillator that emits sound waves to the speaker's lips, etc., 2 is an acoustic resonance system such as the roof of the mouth, and 5 is a sound wave that is integrated with the sound wave oscillator 1 and is emitted from the sound wave oscillator 1. Reverberant sound that occurs when it enters the speaker's lips etc. 3
6 is an analog-to-digital converter that converts the echo sound 3 and the voice 4 into digital audio signals; 7 is a microphone 5;
The time is set so that the input time interval of the audio signal input to the sound wave oscillator 1 does not overlap with the oscillation time interval of the sound wave oscillator 1, and the oscillation of the sound wave oscillator 1 and the input of the audio 4 are switched, and the audio signal a is output. An input switching unit 8 is an audio pattern creation unit that converts the audio signal a into an input audio pattern as a time series of feature vectors and an audio standard pattern, and 9 is connected to the audio pattern creation unit 8 and vibrates when no audio is input. a vibration noise storage unit that stores noise; 10 a standard pattern registration unit that stores a plurality of standard audio patterns; 11 an input pattern registration unit that registers an input audio pattern created from the digitized audio signal a; 12 13 is a pattern transfer unit that transfers the monosyllabic standard pattern registered in the standard pattern registration unit 10 and the input voice pattern registered in the input pattern registration unit 11; A DP matching section 14 performs recognition matching with a syllable standard pattern, and 14 is an output determination section that selects candidates for input speech patterns.

次にこのように構成された装置の動作について説明する
。まず音声入力切替部７は、音波発振器ｌへ第２図（ａ
ｌに示す発振指示信号Ｓ１を送出する。Next, the operation of the apparatus configured as described above will be explained. First, the audio input switching unit 7 sends the sound wave oscillator l to the sound wave oscillator l as shown in FIG.
The oscillation instruction signal S1 shown at l is sent out.

また音声人力切替部７は、発振指示信号Ｓ１がオンにな
っている時間で１に時間τ２を加えた時間τ１＋で２だ
け第２図（ｂ）に示すゲート信号Ｓ２をオフにすること
より、音声信号の入力を禁止する。In addition, the voice manual switching section 7 turns off the gate signal S2 shown in FIG. 2(b) by 2 at a time τ1+, which is 1 plus the time τ2 during which the oscillation instruction signal S1 is on. Prohibit input of audio signals.

話者の口唇付近に配置された音波発振器１は、発振指示
信号Ｓ１がオンになっている時間τ１だけ音響的共振系
２に向けて音波を放射する。A sound wave oscillator 1 placed near the speaker's lips emits sound waves toward the acoustic resonance system 2 for a period of time τ1 during which the oscillation instruction signal S1 is on.

τ２については次のようにして求めることができる。す
なわち、音速をＶ、音波発振器１とマイク５との距離を
ｌとし、音波発振器１が第２図（ｂ）に示す時刻ＴＯに
音波の発振を開始し、時刻Ｔ１に終了したとすると、終
了直前に発振した音波がマイク５に到達する時刻は（Ｔ
　Ｉ’＋　ｊ！　／　ｖ）で与えられる。したがって、
Ｔ２−Ｔｌ−τ２≧ｌ／Ｖの関係を満たすとき、音波発
振器１からマイク５への音波の直接人力を防止すること
ができる。τ2 can be determined as follows. That is, if the speed of sound is V, the distance between the sound wave oscillator 1 and the microphone 5 is l, and the sound wave oscillator 1 starts oscillating a sound wave at time TO shown in FIG. 2(b) and ends at time T1, then The time when the sound wave oscillated just before reaches the microphone 5 is (T
I'+ j! /v). therefore,
When the relationship T2-Tl-τ2≧l/V is satisfied, direct human input of sound waves from the sound wave oscillator 1 to the microphone 5 can be prevented.

たとえば、ｖ−３４０ｍ／ｓｅｅ、　　ｌ１＝１．５ｃ
ｍとし、１フレームのスロットタイムを９　ｍ５ｅｃ、
音声信号のサンプリング周波数を８　ｋＨｚ、音波発振
器１の発信周波数を５〜１５ｋＨｚとして、τ１、τ２
を計算すると、周波数が５ｋＨｚの音波の１周期は０．
２ｍ５ｅｃで、またｌ／　ｖ　＝　０．０４４１となる
。したがって音声入力切替部７は、一定周期で、発信指
示信号Ｓ１を０．２ＩＩｌｓｅｃだけオンにし、ゲート
信号Ｓ２を０．２４４１ｍ５ｅｃだけオフにする。For example, v-340m/see, l1=1.5c
m, the slot time of one frame is 9 m5ec,
Assuming that the sampling frequency of the audio signal is 8 kHz and the oscillation frequency of the sound wave oscillator 1 is 5 to 15 kHz, τ1, τ2
When calculating, one period of a sound wave with a frequency of 5kHz is 0.
At 2m5ec, l/v = 0.0441 again. Therefore, the audio input switching unit 7 turns on the transmission instruction signal S1 for 0.2IIlsec and turns off the gate signal S2 for 0.2441m5ec at regular intervals.

このようにして、口蓋などの音響的共振系２からの反響
音３および話者の音声４はマイク５に入力され、アナロ
グデジタル変換器６でデジタル信。In this way, the echo sound 3 from the acoustic resonant system 2 such as the palate and the speaker's voice 4 are input to the microphone 5 and converted into digital signals by the analog-to-digital converter 6.

号の音声信号に変換され、音声入力切替部７を経由して
間欠的な音声信号ａとして音声パタン作成部８に入力さ
れる。音声パタン作成部８は、この音声信号ａにより、
第３図に示す音声パタンを作成する。この音声パタンは
、口唇開閉情報を付加した場合の時系列配列の例であり
、周波数帯域別音声パタン１５，１６．１７．１８が時
系列にフレーム１９．２０．２１の順に登録される。The signal is converted into an audio signal of the same number, and is inputted to the audio pattern creation unit 8 as an intermittent audio signal a via the audio input switching unit 7. The audio pattern creation unit 8 uses this audio signal a to
Create the audio pattern shown in Figure 3. This audio pattern is an example of a time-series arrangement when lip opening/closing information is added, and frequency band-based audio patterns 15, 16, 17, and 18 are registered in chronological order in the order of frames 19, 20, and 21.

音波発振器１とマイク５は、第４図に示ずように、それ
ぞれ、支柱２３．２４により支持棹２２に結合されてい
る。音波発振器１とマイク５はこのような構成であるの
で、音波発振器１の音波発生と同時に機械振動を生じた
場合、この機械振動は、話者の発声の有無にかかわらず
、音波発振器１が動作している間、振動雑音としてマイ
ク５に入力される。話者がいない時は、減衰していく機
械振動による振動雑音のみ音声パターン作成部８に入力
されるので、この時、音声パタン作成部８に接続された
振動雑音記憶部９は、■フレーム当たりの振動雑音の成
分を各チャネル毎に記憶する。The acoustic wave oscillator 1 and the microphone 5 are coupled to the support rod 22 by struts 23, 24, respectively, as shown in FIG. Since the sound wave oscillator 1 and the microphone 5 have such a configuration, if mechanical vibration is generated at the same time as the sound wave is generated by the sound wave oscillator 1, this mechanical vibration will cause the sound wave oscillator 1 to operate regardless of whether or not the speaker is speaking. During this time, vibration noise is input to the microphone 5. When there is no speaker, only vibration noise due to attenuating mechanical vibrations is input to the voice pattern creation unit 8. At this time, the vibration noise storage unit 9 connected to the voice pattern creation unit 8 stores The vibration noise components of are stored for each channel.

話者が発声すると、第３図に示す音声パターン作成部８
の音声パタンには、話者の発声による音声４と音波発振
器１からの反響音３とを含む音声パワーの周波数帯域別
成分比から振動雑音記憶部９に記憶された値を差し引い
た数値が登録される。When the speaker speaks, the voice pattern creation section 8 shown in FIG.
In the voice pattern, a value obtained by subtracting the value stored in the vibration noise storage unit 9 from the frequency band component ratio of the voice power including the voice 4 produced by the speaker and the echo sound 3 from the sound wave oscillator 1 is registered. be done.

音声パタン作成部８の動作には、単音節標準パタン登録
モードと入力パタン登録モードとの２つのモードがある
。単音節標準パタン登録子−ドの場合、デジタル化され
た音声信号ａおよび振動雑音記憶部９に記憶されている
振動雑音から単音節標準パタンを作成して標準パタン登
録部１０に登録する。音声パタン作成部８が入力パタン
登録モードの場合、音声信号ａおよび振動雑音記憶部９
に記憶されている振動雑音から入力音声パタンを作成し
て入力パタン登録部１１に登録する。The speech pattern creation section 8 operates in two modes: a monosyllabic standard pattern registration mode and an input pattern registration mode. In the case of a monosyllabic standard pattern register, a monosyllabic standard pattern is created from the digitized audio signal a and the vibration noise stored in the vibration noise storage section 9, and is registered in the standard pattern registration section 10. When the audio pattern creation section 8 is in the input pattern registration mode, the audio signal a and the vibration noise storage section 9
An input voice pattern is created from the vibration noise stored in and registered in the input pattern registration section 11.

標準パタン登録部１０に登録された単音節標準パタンと
入力パタン登録部１１に登録された入力音声パタンとは
、パタン転送部１２により、ＤＰマツチング部１３に転
送される。ＤＰマツチング部１３は、標準パタン登録部
ｌＯからパタン転送部１２を経て転送された全ての単音
節標準パタンと入力パタン登録部１１から送られた入力
音声パタンとの認識照合を行ない、その相異度を算定し
て相異度の小さい順に単音節標準パタンを並べかえ、そ
の相異度を付して出力判定部１４に転送する。出力判定
部１４は、ＤＰマツチング部１３から入力された単音節
標準パタンの配列から相異度の小さい順にｍ個選択し、
これを入力音声パタンの候補とするものである。ただし
、入力音声パタンの候補となったｍ個の単音節標準パタ
ン中、その相異度が一定値Ｒｅより大きいものについて
は候補から除外される。The monosyllabic standard pattern registered in the standard pattern registration section 10 and the input speech pattern registered in the input pattern registration section 11 are transferred to the DP matching section 13 by the pattern transfer section 12. The DP matching unit 13 performs recognition comparison between all the monosyllabic standard patterns transferred from the standard pattern registration unit IO via the pattern transfer unit 12 and the input speech pattern sent from the input pattern registration unit 11, and identifies the differences. The monosyllabic standard patterns are rearranged in descending order of degree of dissimilarity and transferred to the output determination section 14 with the degree of dissimilarity added thereto. The output determination unit 14 selects m pieces from the array of monosyllabic standard patterns inputted from the DP matching unit 13 in order of decreasing degree of difference,
This is used as a candidate for the input voice pattern. However, among the m monosyllabic standard patterns that are candidates for input speech patterns, those whose degree of dissimilarity is greater than a certain value Re are excluded from the candidates.

このように、口唇付近に備えた音波発振器１を音源とす
る音声を口唇９ロ腔等の音響的共振系２に向けて放射さ
せると、その反響音３を捕らえることが可能で、話者が
発声していない時点も含めて発声に密接する調音器官の
変動情報を捕らえることが可能となる。In this way, when the sound source is the sound wave oscillator 1 provided near the lips and is emitted toward the acoustic resonance system 2 such as the cavity of the lips 9, it is possible to capture the echo sound 3, and the speaker It becomes possible to capture information on fluctuations in the articulatory organs that are closely related to vocalization, including at times when vocalization is not occurring.

さらに、音波発振器１とマイク５を一体化することによ
り、安定した特性の再現が可能となる。Furthermore, by integrating the sonic wave oscillator 1 and the microphone 5, stable characteristics can be reproduced.

また、単音節標準パタン作成時および入力音声パタン作
成時には振動雑音成分を消去した音声パタンを作成でき
る。Furthermore, when creating a monosyllabic standard pattern and when creating an input speech pattern, it is possible to create a speech pattern with vibration noise components removed.

なお、音波発振器１が放射する音波の周波数は、たとえ
ば音声周波数領域を避けて高周波領域を用いてもよい。Note that the frequency of the sound waves emitted by the sound wave oscillator 1 may be in a high frequency range, for example, avoiding the audio frequency range.

〔Effect of the invention〕

以上説明したように本発明は、音声入力部と固定された
音波発振器における音波発生に伴う機械振動により生じ
る振動雑音を予め記憶し、音波発振器から放射された音
波が直接音声入力部に入力されることにより生じる音声
信号を阻止し２話者の音声によって生じた音声信号を記
憶された振動雑音により補正することにより、口蓋等の
調音器官による反響音を確実に捕らえるようにしたので
、これに基づいた口唇等の開閉情報を単音節標準パタン
および単音節人力パタンに付加でき、認識精度の向上を
図ることができるという効果がある。As explained above, the present invention stores in advance the vibration noise generated by mechanical vibrations accompanying the generation of sound waves in the sound wave oscillator fixed to the sound input section, and the sound waves radiated from the sound wave oscillator are directly input to the sound input section. By blocking the voice signals generated by the two speakers and correcting the voice signals generated by the voices of the two speakers using stored vibration noise, we were able to reliably capture the echoes from the articulatory organs such as the palate. It is possible to add information on the opening and closing of lips, etc. to monosyllabic standard patterns and monosyllabic manual patterns, thereby improving recognition accuracy.

また、話者に装着する器具を用いずに口唇等の開閉情報
を得ることができる効果がある。さらに、繰り返して音
声認識を行なう場合でも、音波発振器とマイクとの相互
位置が固定しているので、毎回安定した精度の高い認識
を行なうことができる効果がある。Furthermore, there is an effect that information on the opening and closing of lips, etc., can be obtained without using any equipment worn by the speaker. Furthermore, even when performing voice recognition repeatedly, since the mutual positions of the sonic wave oscillator and the microphone are fixed, there is an effect that stable and highly accurate recognition can be performed every time.

[Brief explanation of drawings]

第１図は本発明に係わる音声認識装置の一実施例を示す
ブロック系統図、第２図はその動作を説明するためのタ
イムチャート、第３図は音声の登録状態を示す音声パタ
ン図、第４図は音波発振器とマイクとの結合状態を示す
構成図である。１・・・・音波発振器、２・・・・音響的共振系、５・
・・・マイク、６・・・・アナログデジタル変換器、７
・・・・音声入力切替部、８・・・・音声パタン作成部
、９・・・・振動雑音記憶部、１０・・・・標準パタン
登録部、１１・・・・入力パタン登録部、１２・・・・
パタン転送部、１３・・・・ＤＰマツチング部、１４・
・・・出力判定部、２２・・・・支持棹、２３．２４・
・・・支柱。FIG. 1 is a block system diagram showing an embodiment of the speech recognition device according to the present invention, FIG. 2 is a time chart for explaining its operation, FIG. 3 is a speech pattern diagram showing the state of speech registration, and FIG. FIG. 4 is a configuration diagram showing a state of coupling between a sound wave oscillator and a microphone. 1... Sonic oscillator, 2... Acoustic resonance system, 5...
...Microphone, 6...Analog-digital converter, 7
...Audio input switching unit, 8...Audio pattern creation unit, 9...Vibration noise storage unit, 10...Standard pattern registration unit, 11...Input pattern registration unit, 12・・・・・・
Pattern transfer section, 13...DP matching section, 14.
... Output determination section, 22... Support rod, 23.24.
...Strut.

Claims

[Claims]

a voice pattern creation section that converts a voice signal into an input voice pattern and a voice standard pattern as a time series of feature vectors; and a standard pattern registration section that stores a plurality of the voice standard patterns; A speech recognition device that determines input speech by performing pattern matching with a standard speech pattern includes a sonic oscillator that emits sound waves to the lips of a speaker, and a system that is integrated with the sonic oscillator and that emits sound waves from the sonic oscillator. an audio input unit that detects the utterance of the speaker along with a reverberant sound generated by the sound waves incident on the lips or the like of the speaker; and an input time interval of the audio signal input from the audio input unit and oscillation of the sound wave oscillator. an audio input switching section that switches between the oscillation of the sound wave oscillator and the input of the audio signal by setting a time so that the time intervals do not overlap; and an audio input switching section that is connected to the audio pattern creation section and stores vibration noise when no audio is input. and a vibration noise storage unit, and when a voice is input, the voice signal is corrected using the numerical value stored in the vibration noise storage unit to create an input voice pattern and a voice standard pattern. Speech recognition device.