JP2007292814A

JP2007292814A - Voice recognition apparatus

Info

Publication number: JP2007292814A
Application number: JP2006117300A
Authority: JP
Inventors: Toru Marumoto; 徹丸本
Original assignee: Alpine Electronics Inc
Current assignee: Alpine Electronics Inc
Priority date: 2006-04-20
Filing date: 2006-04-20
Publication date: 2007-11-08
Anticipated expiration: 2026-04-20
Also published as: JP5115944B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition apparatus, capable of removing audio sound which has conventionally been left to remain in an audio cancellation system, and capable of increasing the processing speed. <P>SOLUTION: An error signal is obtained by subtracting a signal of an adaptive filter 5, which is controlled by an LMS algorithm by using a step-size parameter (μ) adjusted by a step-size parameter adjusting section 7, from a mixed signal of voice of a user from a microphone 3 and audio sound, in a subtraction unit 4. In a system in which the error signal is input to a voice recognition engine, μ is decreased gradually, from a predetermined period later than voice input switch pressing, and μ is kept at a lowest value during utterance period. Voice intensity calculation of the error signal is performed, and after a predetermined period from the end of a period when a voice intensity is more than a predetermined threshold, μ is gradually increased and returned to the original value. The predetermined period and threshold which serve as reference for determining the presence of utterance, are switched over and set by personal information, and can be changed gradually by learning with voice recognition processing. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声認識装置において、マイクから入力されるオーディオ音を消去することにより音声認識率を高める音声認識装置に関する。 The present invention relates to a speech recognition device that increases a speech recognition rate by erasing audio sound input from a microphone in the speech recognition device.

近年、各種機器の作動を音声により指示し、音声認識装置によりこれを認識して機器の作動制御を行うことが、パソコンを初めとして一般家庭用機器等の各種の分野で広く行われており、その研究開発が急速に進められている。このような音声により機器の作動を制御する分野の一つとして、各種の車載機器を音声により操作することが注目されている。即ち、車載機器の多くは運転者が操作する場合が多く、一方、運転者は安全運転のために、できる限り車載機器の操作に注意をそらせることは好ましくない。 In recent years, the operation of various devices is instructed by voice, and this is recognized by a voice recognition device and the operation of the device is controlled widely in various fields such as personal computers and general household devices. The research and development is progressing rapidly. As one of the fields for controlling the operation of a device by such voice, attention has been paid to operating various on-vehicle devices by voice. That is, many of the in-vehicle devices are often operated by the driver, and on the other hand, it is not preferable for the driver to pay attention to the operation of the in-vehicle devices as much as possible for safe driving.

近年の車載機器はオーディオ装置の高度化、ナビゲーション装置の機能の多様化等のため、これらの機器に対して各種の作動指示を行うことが多くなっている。この対策として、上記音声認識装置を用い、運転者は前方を注視したままで、例えばナビゲーション装置の近隣施設検索を音声で指示し、ナビゲーション装置では検索結果を画面に表示し、また音声でこれに応える、というシステムが実用化されている。 In recent years, in-vehicle devices are increasingly giving various operation instructions to these devices because of advanced audio devices and diversified functions of navigation devices. As a countermeasure, the driver uses the voice recognition device, and the driver keeps his eyes on the front. For example, the navigation device instructs a nearby facility search by voice, and the navigation device displays the search result on the screen. A system that responds has been put into practical use.

しかしながら、音声認識装置を上記のような車載機器の制御を行うために、車両に搭載すると、車内にはエンジン音、タイヤの走行音、車の風切り音、更にはオーディオ音や周囲の人の話し声等が混在し、このような騒音の中でマイクに向かって話した言葉に基づいて、操作指示の内容を認識することはきわめて困難である。したがって、広く研究開発が行われている音声認識の技術分野において、車載機器の操作指示のための音声認識は、最も困難な分野の一つということができる。このように騒音の多い環境下で音声認識を行うためには、マイクから入力される音声に混って入ってくる騒音成分を取り除いて、できる限り使用者の音声のみを入力することが必要となる。 However, when the voice recognition device is mounted on a vehicle in order to control the in-vehicle device as described above, the engine sound, the tire running sound, the car wind noise, the audio sound and the voice of the surrounding people are installed in the vehicle. It is extremely difficult to recognize the contents of the operation instruction based on words spoken to the microphone in such noise. Accordingly, in the technical field of speech recognition that is widely researched and developed, speech recognition for operating instructions of in-vehicle devices can be said to be one of the most difficult fields. In order to perform speech recognition in such a noisy environment, it is necessary to remove the noise component mixed in with the speech input from the microphone and input only the user's speech as much as possible. Become.

そのための技術として、騒音や音声を適応フィルタを通し、各種処理を行い、所望の特性の音声を得る研究が行われている。適応フィルタによる制御方式自体は広く知られている技術であり、例えば図４に示すように、第２信号入力ｘ（ｎ）をタップ係数ｗ（ｎ）が可変のＦＩＲフィルタ（有限インパルス応答フィルタ）２２を通し、出力ｙ（ｎ）を得る。この出力ｙ（ｎ）と目標信号としての第１信号入力ｄ（ｎ）を減算器２３に入れ、その誤差ｅ（ｎ）を得る。この誤差ｅ（ｎ）によって変化する適応アルゴリズム（例えばＬＭＳ）２４により、ＦＩＲフィルタ２２のタップ係数ｗ（ｎ）を制御し、誤差ｅ（ｎ）のパワーをできる限り０に近づける。この適応フィルタで用いられる適応アルゴリズムとしては種々のものが提案されており、例えば学習同定法、ＬＭＳ法、ＲＭＳ法、射影法等が知られている。このような適応フィルタを用いることにより、フィルタ係数は、任意の初期状態から逐次書き換えられ、徐々に誤差を最小にするタップ係数ｗ０に近づけていくことができる。 As a technique for that purpose, research is being conducted to obtain a sound having a desired characteristic by performing various kinds of processing on noise and voice through an adaptive filter. The control method using the adaptive filter is a well-known technique. For example, as shown in FIG. 4, the FIR filter (finite impulse response filter) in which the tap coefficient w (n) is variable for the second signal input x (n). 22 to obtain the output y (n). The output y (n) and the first signal input d (n) as the target signal are input to the subtracter 23 to obtain the error e (n). The tap coefficient w (n) of the FIR filter 22 is controlled by an adaptive algorithm (for example, LMS) 24 that changes according to the error e (n), and the power of the error e (n) is made as close to 0 as possible. Various adaptive algorithms used in this adaptive filter have been proposed. For example, a learning identification method, an LMS method, an RMS method, and a projection method are known. By using such an adaptive filter, the filter coefficient can be sequentially rewritten from an arbitrary initial state, and gradually approached to the tap coefficient w0 that minimizes the error.

上記適応フィルタにおいて例えばＬＭＳアルゴリズムを用いてタップ係数をリアルタイムに更新するものにおいては、
ｗｊ（ｎ＋１）＝ｗｊ（ｎ）＋２μ（ｎ）・ｅ（ｎ）・ｘｊ（ｎ）
ｊ＝０，１・・・，Ｎ
ｅ（ｎ）＝ｄ（ｎ）−ｙ（ｎ）
の更新式を用いている。 In the adaptive filter that updates the tap coefficient in real time using, for example, the LMS algorithm,
wj (n + 1) = wj (n) + 2μ (n) · e (n) · xj (n)
j = 0, 1,..., N
e (n) = d (n) -y (n)
The update formula is used.

ここでμはステップサイズパラメータと呼ばれ、適応フィルタのタップ係数の更新の度合いを制御するパラメータであり、これが大きいとタップ係数の修正量が多くなるため収束が速くなる。しかしながら、修正量が大きい分だけ、係数更新の妨害となる成分が存在する場合にはその影響を強く受けて、残留誤差量が多くなる。一方、反対にステップサイズパラメータが小さい場合には、収束が遅くなるが、妨害信号成分の影響が少なく残留誤差量は小さくなる。 Here, μ is called a step size parameter, and is a parameter that controls the degree of update of the tap coefficient of the adaptive filter. If this value is large, the correction amount of the tap coefficient increases, so that the convergence is accelerated. However, if there is a component that interferes with coefficient updating by the amount of correction, the effect is strongly influenced and the residual error amount increases. On the other hand, when the step size parameter is small, convergence is delayed, but the influence of the disturbing signal component is small and the residual error amount is small.

一方、例えば車室内で音声認識装置を用いるに際して、車室内で最も音声認識を妨害する大きな音としては、オーディオ装置からの音が存在し、したがって、音声認識時にはこのオーディオ装置の音を消すことが好ましい。しかしながら、音声で指示を行うたびにオーディオ装置を消す操作は面倒であり、且つ、例えばオーディオの使用中にオーディオ装置に対して音量の変更等の操作指示を行うときには、そのたび毎にオーディオ装置の音を消すことは適切ではない。その対策として、音声認識装置において、マイクから入るオーディオ音をキャンセルするために、スピーカから出力されるオーディオ信号を直接入力して、このオーディオ信号を前記適応フィルタに入力し、適応フィルタから出力されたオーディオの調整信号と、マイクから音声信号と共に入力されたオーディオ信号とを減算器に入力し、その誤差が最小になるように、あるいは所定の状態になるように適応フィルタを調整し、それにより減算器からの出力信号中にオーディオ信号が残留しないようにすることが考えられている。 On the other hand, for example, when a voice recognition device is used in a vehicle interior, there is a sound from the audio device as the loudest sound that disturbs the voice recognition most in the vehicle interior. Therefore, at the time of voice recognition, the sound of this audio device may be turned off. preferable. However, it is troublesome to turn off the audio device each time an instruction is given by voice. For example, when an operation instruction such as changing the volume is given to the audio device while using the audio, the audio device It is not appropriate to mute the sound. As a countermeasure, in the speech recognition apparatus, in order to cancel the audio sound coming from the microphone, the audio signal output from the speaker is directly input, and the audio signal is input to the adaptive filter and output from the adaptive filter. The audio adjustment signal and the audio signal input together with the audio signal from the microphone are input to the subtractor, and the adaptive filter is adjusted so that the error is minimized or in a predetermined state, and the subtraction is thereby performed. It is considered to prevent the audio signal from remaining in the output signal from the device.

このようなオーディオキャンセルシステムの基本的な構成は、図５に示すように、前記図４の適応フィルタの構成を備え、特にこのシステムにおいては、ＬＭＳアルゴリズム２４でタップ係数ｗ（ｎ）が制御されるＦＩＲフィルタ２２への入力信号である第２入力として、車室内のスピーカ２６に出力するオーディオ出力部が参照信号発生部２７として接続しており、また、減算器２３への第２入力としては、車室内に設けた音声認識装置用のマイク２８からの信号を対応させ、このマイク２８からの信号は遅延回路２９を介して減算器２３に出力するようにしている。この時マイク２８からは、利用者３０からの認識すべき音声ＣｓＸｓと、音声認識装置作動中はキャンセルされるべき騒音となるオーディオ音ＣｎＸｎが入力される。図５中において二点鎖線で示すように、減算器２３における誤差信号ｅ（ｎ）を図４と同様にＬＭＳアルゴリズム２４に入力し、この信号をそのまま音声認識エンジン３６に出力するシステムが、図４の適応フィルタをそのまま用いた構成となる。ここで
ｙ（ｎ）＝ｗ（ｎ）^Ｔ・ｘ（ｎ）・・（１）
ｅ（ｎ）＝ｄ（ｎ）−ｙ（ｎ）・・（２）
ｗ（ｎ＋１）＝ｗ（ｎ）＋２μ（ｎ）・ｅ（ｎ）・ｘ（ｎ）
・・（３）
ｗ（ｎ）＝［ｗ（０，ｎ）ｗ（１，ｎ）・・・ｗ（Ｎ−１，ｎ）］^Ｔ・・（４）
ｘ（ｎ）＝［ｘ（ｎ）ｘ（ｎ−１）・・・ｘ（ｎ−Ｎ＋１）］^Ｔ・・（５）

の各式が成り立ち、特に適応更新式（３）により更新を行う。 As shown in FIG. 5, the basic configuration of such an audio cancellation system includes the configuration of the adaptive filter shown in FIG. 4. In particular, in this system, the tap coefficient w (n) is controlled by the LMS algorithm 24. As a second input that is an input signal to the FIR filter 22, an audio output unit that outputs to a speaker 26 in the vehicle interior is connected as a reference signal generator 27, and as a second input to the subtractor 23 The signal from the microphone 28 for the voice recognition device provided in the vehicle interior is made to correspond, and the signal from the microphone 28 is output to the subtracter 23 via the delay circuit 29. At this time, from the microphone 28, the voice CsXs to be recognized from the user 30 and the audio sound CnXn that becomes noise to be canceled while the voice recognition apparatus is operating are input. As shown by a two-dot chain line in FIG. 5, an error signal e (n) in the subtracter 23 is input to the LMS algorithm 24 as in FIG. 4, and this system outputs the signal to the speech recognition engine 36 as it is. 4 is used as it is. Here ^{y (n) = w (n} ) T · x (n) ·· (1)
e (n) = d (n) -y (n) (2)
w (n + 1) = w (n) + 2μ (n) · e (n) · x (n)
(3)
w (n) = [w (0, n) w (1, n)... w (N-1, n)] ^T (4)
x (n) = [x (n) x (n−1)... x (n−N + 1)] ^T (5)

Each of the following equations holds, and in particular, the update is performed by the adaptive update equation (3).

このようなシステムにおいて、利用者３０が車室内でスピーカ２６からのオーディオを聴いている状態で、音声認識装置を利用するためマイク２８に音声を発するときには、マイク２８には車室内において特に大きな音であるオーディオからの音も入力される。マイク２８から入力されたこれら音声等の信号は、遅延回路２９を介して減算器２３のプラス側にｄ（ｎ）として入力される。一方、スピーカ２６にオーディオ信号を出力しているオーディオ出力部の信号を参照信号ｘ（ｎ）としてＦＩＲフィルタ２２に入力し、ＦＩＲフィルタ２２においてはＬＭＳアルゴリズム２４によりタップ係数ｗ（ｎ）が制御され、出力信号ｙ（ｎ）を得る。 In such a system, when the user 30 is listening to the audio from the speaker 26 in the vehicle interior and makes a sound to the microphone 28 in order to use the speech recognition device, the microphone 28 has a particularly loud sound in the vehicle interior. Sound from the audio is also input. These signals such as sound input from the microphone 28 are input as d (n) to the plus side of the subtractor 23 via the delay circuit 29. On the other hand, the signal of the audio output unit outputting the audio signal to the speaker 26 is input to the FIR filter 22 as the reference signal x (n), and the tap coefficient w (n) is controlled by the LMS algorithm 24 in the FIR filter 22. , An output signal y (n) is obtained.

この出力信号ｙ（ｎ）を減算器のマイナス側に入力し、両者の減算値、即ち両者の誤差である、ｅ（ｎ）＝ｄ（ｎ）−ｙ（ｎ）を得る。この誤差ｅ（ｎ）は理想的には車室内のスピーカからマイクに入力されたオーディオ音が、適応フィルタで処理されたオーディオ信号によりキャンセルされたものとなる。したがって、これを音声認識エンジン３６に入力すると、車室内のオーディ音がキャンセルされた、ほぼ利用者の音声のみの信号となる。このとき両者に誤差が生じている際には、この誤差ｅ（ｎ）をフィードバックしてＬＭＳアルゴリズム２４に入れ、ＦＩＲフィルタ２２のタップ係数ｗ（ｎ）を調整して、前記誤差ｅ（ｎ）のパワーを最小にするように制御する。 This output signal y (n) is input to the minus side of the subtracter to obtain the subtraction value of both, that is, the error between them, e (n) = d (n) −y (n). The error e (n) is ideally obtained by canceling the audio sound input from the speaker in the vehicle interior to the microphone by the audio signal processed by the adaptive filter. Therefore, when this is input to the voice recognition engine 36, the audio signal in the passenger compartment is canceled and the signal is almost only the voice of the user. At this time, if an error occurs between the two, the error e (n) is fed back to the LMS algorithm 24, and the tap coefficient w (n) of the FIR filter 22 is adjusted to obtain the error e (n). Control to minimize the power.

上記のように、車室内においてオーディオ装置が作動しスピーカから音が出ている状態の中で、音声認識装置により各種機器の作動を行わせるため、マイクから音声認識装置に音声を入力するときに混入するオーディオ音キャンセルするに際して、適応フィルタを用いたオーディオサウンドキャンセレーション（ＡＳＣ）システムの開発が行われている。このシステムでは各スピーカーからマイク位置までの伝達関数を推定し、マイク位置でのオーディオ音を模擬生成することにより、音楽再生下での音声認識処理において、マイク入力信号からこのオーディオ信号のみを減算することで、発話音声のみを残すことができ、その結果、音楽再生下での音声認識を可能とすることができる。 As described above, when the audio device is operated in the passenger compartment and the sound is output from the speaker, various devices are operated by the voice recognition device. In order to cancel the mixed audio sound, an audio sound cancellation (ASC) system using an adaptive filter has been developed. In this system, the transfer function from each speaker to the microphone position is estimated, and the audio sound at the microphone position is simulated, and only this audio signal is subtracted from the microphone input signal in the speech recognition processing under music playback. Thus, it is possible to leave only the uttered voice, and as a result, it is possible to perform voice recognition under music playback.

このシステムにおける伝達関数の推定には、オーディオ信号をリファレンスとしたノーマライズド（Normalized）ＬＭＳアルゴリズムを用いている前記のような現行システムにおける出力結果例を図６に示す。ここで、ステップサイズパラメータμ（ｎ）は、係数更新の程度を調節するものであるが、これが大きいと追従速度が向上する一方、走行ノイズや会話音声、発話音声等の外乱の影響を受けやすく、エコーがかかった波形が出力され、その結果音声認識が失敗してしまう。このように追従性と外乱耐性はトレードオフの関係にある。 For estimation of the transfer function in this system, FIG. 6 shows an example of the output result in the current system as described above using a normalized LMS algorithm with an audio signal as a reference. Here, the step size parameter μ (n) is used to adjust the degree of coefficient update. If the step size parameter μ (n) is large, the follow-up speed is improved. , An echoed waveform is output, and as a result, speech recognition fails. Thus, the followability and the disturbance tolerance are in a trade-off relationship.

そこで、本件出願人は特開２００１−１９５０８５号公報において、図５に示すようなステップサイズパラメータ変更部３１を設けることを提案し開示している。即ち、通常μの値を安定条件を満たす大きめの値μ１で適応動作させており、音声認識作動を開始するために利用者が押下する音声入力スイッチ３５の押下情報を入力し、音声認識処理が終わるまで、もしくは予め設定された一定時間が経過するまでの間、μの値をμ１より小さいμ２に切り換える。このようにすることにより、音声認識エンジン３６にデータを入力させる区間のみエコーがかからない発話音声のみを抽出して、適切な音声認識をすることができるようになる。なお、前記ステップサイズパラメータμの値を音声入力スイッチ３５の押下に対応してμ２に切り換えた後、音声入力スイッチの押下を待たず、予め設定した時間後、更には音声認識応答が行われたときに元のμ１の値に戻すことも考えられている。
特開２００１−１９５０８５号公報 Therefore, the present applicant proposes and discloses providing a step size parameter changing unit 31 as shown in FIG. 5 in Japanese Patent Laid-Open No. 2001-195085. That is, the normal value of μ is adaptively operated with a larger value μ1 that satisfies the stability condition, and the pressing information of the voice input switch 35 that the user presses to start the voice recognition operation is input, and the voice recognition processing is performed. The value of μ is switched to μ2, which is smaller than μ1, until the end or until a predetermined time has elapsed. By doing this, it is possible to extract only the speech that is not echoed only in the section in which data is input to the speech recognition engine 36, and to perform appropriate speech recognition. Note that after the value of the step size parameter μ was switched to μ2 in response to the pressing of the voice input switch 35, the voice recognition response was performed after a preset time without waiting for the voice input switch to be pressed. It is sometimes considered to restore the original value of μ1.
JP 2001-195085 A

上記のような処理が行われた音声認識処理の結果を図６に示す。この例からわかるように、図６（ｃ）の音声入力スイッチ押下から前記のようなオーディオ音の除去処理を行うためにステップサイズパラメータをμ１からμ２に減少させ、その後予め設定した所定の時間後、或いは音声認識応答が行われたときに元のμ１に復帰するように設定している際に、同図（ｂ）に示すように、利用者は音声入力スイッチ押下から各人特有の時間の後に発話を開始することとなる。その後発話が行われ、この音声認識装置においては例えば単語が一つ入力される場合と、比較的長い文章が入力される場合とが存在するので、発話時のそれぞれの場合によって入力される発話信号が大きく異なることに対応するため、利用者の発話に関わらず所定時間待つこととなる。 FIG. 6 shows the result of the speech recognition processing in which the above processing has been performed. As can be seen from this example, the step size parameter is decreased from μ1 to μ2 in order to perform the above-described audio sound removal processing from the pressing of the voice input switch in FIG. 6C, and then after a predetermined time set in advance. Or, when it is set to return to the original μ1 when a voice recognition response is made, as shown in FIG. The utterance will be started later. After that, speech is performed, and in this speech recognition apparatus, for example, there are cases where one word is inputted and cases where a relatively long sentence is inputted. Therefore, the speech signal inputted depending on each case at the time of speech. Therefore, the user waits for a predetermined time regardless of the user's utterance.

図６（ａ）のようなマイクからの音声信号が存在するとき、前記のようなオーディオ音の除去処理を行うことにより、同図（ｂ）ような処理済み信号を得ることができ、この処理済み信号に基づいて音声認識を行うと、このようなオーディオ音の除去処理を行わないものに比べてはるかに音声認識率が向上するものであるが、それでも未だ十分ではなく、実際には音声認識に失敗することも多い。 When the audio signal from the microphone as shown in FIG. 6A exists, the processed signal as shown in FIG. 6B can be obtained by performing the audio sound removal process as described above. If speech recognition is performed based on a completed signal, the speech recognition rate will be much higher than that without such audio sound removal processing. Often fail.

その原因を検討すると、主たる要因として、利用者による音声入力スイッチ押下から発話までの時間、及び実際の発話終了後のしばらくの時間において、図６（ｂ）に示すようなオーディオ音の消し残り成分が残ってしまい、音声認識エンジンの音声区間検出およびパターン照合処理に悪影響を及ぼしているためであることがわかった。 When the cause is examined, the main cause is that the audio sound unremaining component as shown in FIG. 6B in the time from when the user presses the voice input switch to the utterance and for a while after the actual utterance ends. It has been found that this is because it has a negative effect on speech segment detection and pattern matching processing of the speech recognition engine.

また、音声認識に成功した場合でも、応答時間は音楽再生を行っていない通常使用時に比べて長くなっており、前記オーディオキャンセレーション（ＡＳＣ）システムの使い勝手を阻害していた。この阻害の程度は短い単語ほど差は顕著であり、２〜３秒程度遅くなることも多い。その対策として、音声認識エンジン内部の処理を向上させる手法も存在するが、音声認識エンジン内部はこれを設計し製造している企業は公表しておらず、この音声認識エンジンを用いて例えばナビゲーション装置に適用しようとするものにとっては、この部分はブラックボックスとなっているため、いじることができず、それ以外の手法で対処するしかない。 Even when the speech recognition is successful, the response time is longer than that during normal use when music is not being played back, which hinders the usability of the audio cancellation (ASC) system. The shorter the degree of inhibition, the more conspicuous the difference is, and the delay is often about 2-3 seconds. As a countermeasure, there is a method for improving the processing inside the voice recognition engine, but the company that designs and manufactures the voice recognition engine is not disclosed, and this voice recognition engine is used for example as a navigation device. For those who are trying to apply to this, this part is a black box, so it can not be messed up, and there is no other way to deal with it.

したがって本発明は、利用者の発話音声のほかにオーディオ音が入力される環境下で音声認識を確実に行うため、オーディオ信号をリファレンス信号としたＬＭＳアルゴリズムを用いたオーディオサウンドキャンセレーションシステムにおいて、従来のシステムで未だ消すことができなかったオーディオ音をほぼ消すことができるようにし、音声認識率を向上させることができるようにすると共に、音声認識処理速度を向上することができるようにした音声認識装置を提供することを主たる目的とする。 Therefore, the present invention is a conventional audio sound cancellation system that uses an LMS algorithm with an audio signal as a reference signal in order to reliably perform speech recognition in an environment in which audio sound is input in addition to the user's speech. Voice recognition that can almost eliminate the audio sound that could not be erased by the system of, and can improve the voice recognition rate and also improve the voice recognition processing speed The main purpose is to provide a device.

本発明に係る音声認識エンジンは、上記課題を解決するため、音声認識装置に入力する利用者の音声とオーディオ音とを集音するマイクと、前記オーディオ音を出力するオーディオ信号を入力し、ステップサイズパラメータを用いた適応アルゴリズムによりタップ係数を変化させる適応フィルタと、前記適応フィルタの出力信号と前記マイクからの信号を入力する減算器と、前記減算器から出力する両信号の誤差信号を前記適応アルゴリズムに入力すると共に、音声認識エンジンに出力するようにした音声認識装置において、前記誤差信号の音声強度を計算する音声強度計算手段と、前記音声強度計算手段で計算した音声強度が予め設定した閾値以上から以下に切り替わった時点からの予め設定した所定時間を求める発話有無判定手段と、前記発話有無判定手段で発話が無いと判定したとき、予め減少させていたステップサイズパラメータを徐々に増加するステップサイズパラメータ調整手段とを備えたことを特徴とする。 In order to solve the above problems, a speech recognition engine according to the present invention inputs a microphone that collects a user's voice and audio sound input to the speech recognition device, and an audio signal that outputs the audio sound, An adaptive filter that changes a tap coefficient by an adaptive algorithm using a size parameter, a subtractor that inputs an output signal of the adaptive filter and a signal from the microphone, and an error signal of both signals output from the subtractor. In a speech recognition apparatus that is input to an algorithm and output to a speech recognition engine, speech intensity calculation means for calculating the speech intensity of the error signal, and a threshold value in which the speech intensity calculated by the speech intensity calculation means is set in advance An utterance presence / absence determining means for obtaining a predetermined time set in advance from the time point when the above is switched to the following, When it is determined that there is no speech in the speech presence determining means, characterized in that a step size parameter adjusting means for gradually increasing the step size parameter that was reduced in advance.

また、本発明に係る他の音声認識装置は、前記音声認識装置において、前記ステップサイズパラメータ調整手段で、利用者が音声入力スイッチを押下した後の予め設定した所定期間後にステップサイズパラメータを徐々に減少させることを特徴とする。 In addition, in another speech recognition apparatus according to the present invention, in the speech recognition apparatus, the step size parameter adjustment unit gradually increases the step size parameter after a predetermined period after the user presses the speech input switch. It is characterized by decreasing.

また、本発明に係る他の音声認識装置は、前記音声認識装置において、前記音声強度閾値は、音声認識処理結果により学習して変更することを特徴とする。 Further, another speech recognition apparatus according to the present invention is characterized in that in the speech recognition apparatus, the speech intensity threshold value is learned and changed according to a speech recognition processing result.

また、本発明に係る他の音声認識装置は、前記音声認識装置において、前記音声強度が予め設定した閾値以上から以下に切り替わった時点からの予め設定した所定時間は、音声認識処理結果により学習して変更することを特徴とする。 In addition, another speech recognition apparatus according to the present invention, in the speech recognition apparatus, learns a preset predetermined time from the time when the voice intensity is switched from a preset threshold value to a preset threshold value based on a voice recognition processing result. It is characterized by changing.

また、本発明に係る他の音声認識装置は、前記音声認識装置において、前記音声入力スイッチを押下した後の所定期間は、音声認識処理結果により学習して変更することを特徴とする。 Another speech recognition apparatus according to the present invention is characterized in that, in the speech recognition apparatus, a predetermined period after the speech input switch is pressed is learned and changed according to a speech recognition processing result.

また、本発明に係る他の音声認識装置は、前記音声認識装置において、前記発話有無判定手段は、利用者毎に予め設定する個人発話情報設定手段の個人情報により判定基準を変更することを特徴とする。 In another speech recognition apparatus according to the present invention, in the speech recognition apparatus, the utterance presence / absence determination unit changes a determination criterion according to personal information of an individual utterance information setting unit preset for each user. And

また、本発明に係る他の音声認識装置は、前記音声認識装置において、前記発話有無判定手段において発話が無いと判別したときには、音声認識エンジンに零データを出力する出力制御手段を備えたことを特徴とする。 Further, another speech recognition apparatus according to the present invention comprises an output control means for outputting zero data to the speech recognition engine when the speech recognition apparatus determines that there is no speech in the speech presence / absence determination means. Features.

本発明は、上記のように構成したので、利用者の発話音声のほかにオーディオ音が入力される環境下で音声認識を確実に行うため、オーディオ音を出力するオーディオ信号をリファレンス信号として、ＬＭＳアルゴリズムを用いたオーディオサウンドキャンセレーションシステムでオーディオ音を消去する際、従来のシステムで未だ消すことができなかったオーディオ音をほぼ消すことができるようにし、音声認識率を向上させることができるようにすると共に、音声認識処理速度を向上することができる。 Since the present invention is configured as described above, in order to perform voice recognition reliably in an environment where audio sound is input in addition to the user's uttered voice, the audio signal that outputs the audio sound is used as a reference signal as an LMS. When deleting audio with an algorithm-based audio sound cancellation system, the audio sound that could not be erased with the conventional system can be almost erased, and the speech recognition rate can be improved. In addition, the voice recognition processing speed can be improved.

本発明はオーディオキャンセレーションシステムにおける従来取り残していたオーディオ音を除去し、処理速度を向上するという課題を、音声認識装置に入力する利用者の音声とオーディオ音とを集音するマイクと、前記オーディオ音を出力するオーディオ信号を入力し、ステップサイズパラメータを用いた適応アルゴリズムによりタップ係数を変化させる適応フィルタと、前記適応フィルタの出力信号と前記マイクからの信号を入力する減算器と、前記減算器から出力する両信号の誤差信号を前記適応アルゴリズムに入力すると共に、音声認識エンジンに出力するようにした音声認識装置において、前記誤差信号の音声強度を計算する音声強度計算手段と、前記音声強度計算手段で計算した音声強度が予め設定した閾値以上から以下に切り替わった時点からの予め設定した所定時間を求める発話有無判定手段と、前記発話有無判定手段で発話が無いと判定したとき、予め減少させていたステップサイズパラメータを徐々に増加するステップサイズパラメータ調整手段とを備えることにより実現した。 An object of the present invention is to remove the audio sound that has been left behind in the audio cancellation system and improve the processing speed. A microphone that collects the user's voice and audio sound input to the voice recognition device, and the audio An adaptive filter for inputting an audio signal for outputting sound and changing a tap coefficient by an adaptive algorithm using a step size parameter; a subtractor for inputting an output signal of the adaptive filter and a signal from the microphone; and the subtractor In the speech recognition apparatus which inputs the error signal of both signals output from the signal to the adaptive algorithm and outputs the error signal to the speech recognition engine, speech strength calculation means for calculating the speech strength of the error signal, and the speech strength calculation The voice strength calculated by the means is switched from above the preset threshold to below. An utterance presence / absence determination unit that obtains a predetermined time from the time of departure, and a step size parameter adjustment unit that gradually increases a step size parameter that has been decreased in advance when the utterance presence / absence determination unit determines that there is no utterance And realized.

本発明の実施の形態を図面に沿って説明する。図１は本発明の音声認識用オーディオキャンセル装置の実施例を示す。この実施例においては車内用オーディオ装置からのオーディオ出力ｘ（ｎ）が出力されており、この出力は車室内に配置したスピーカー１に出力すると共に、同じ信号を適応フィルタ５に対しても出力している。マイク３にはこのスピーカー１からの音声の他、音声認識処理を行うときには利用者２の発話も入力し、マイク３への入力音声信号はｈ（ｎ）となる。 Embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows an embodiment of an audio canceling apparatus for speech recognition according to the present invention. In this embodiment, an audio output x (n) from the in-vehicle audio apparatus is output, and this output is output to the speaker 1 disposed in the vehicle interior and the same signal is also output to the adaptive filter 5. ing. In addition to the sound from the speaker 1, the utterance of the user 2 is also input to the microphone 3 when performing speech recognition processing, and the input sound signal to the microphone 3 is h (n).

適応フィルタ５は前記図４の基本原理に基づき作動するものであり、前記図５に示す音声認識装置用オーディオキャンセルシステムにおける適応フィルタ２２と同様に作動する。オーディオ信号に対応する適応フィルタ５の出力ｙ（ｎ）は減算器４で、前記マイクに入力した音声信号ｈ（ｎ）による被処理信号ｄ（ｎ）を適宜遅延処理した信号に対して減算を行い、それによりｅ（ｎ）＝ｄ（ｎ）−ｙ（ｎ）を演算し、誤差信号ｅ（ｎ）を得ている。減算器４において得られた誤差信号ｅ（ｎ）は前記従来のものと同様にＬＭＳアルゴリズム６に入力し、図１に示すＬＭＳアルゴリズム６ではこの誤差信号の他に、オーディオ音の参照信号ｘ（ｎ）により、更には後述するステップサイズパラメータ（μ）調整部７の信号によっても調整できるようにしている。 The adaptive filter 5 operates based on the basic principle of FIG. 4 and operates in the same manner as the adaptive filter 22 in the audio canceling system for a speech recognition apparatus shown in FIG. The output y (n) of the adaptive filter 5 corresponding to the audio signal is subtracted by a subtracter 4 from the signal obtained by appropriately delaying the processed signal d (n) by the audio signal h (n) input to the microphone. Thus, e (n) = d (n) −y (n) is calculated to obtain an error signal e (n). The error signal e (n) obtained in the subtracter 4 is input to the LMS algorithm 6 in the same manner as the conventional one. In the LMS algorithm 6 shown in FIG. 1, in addition to this error signal, an audio sound reference signal x ( n), it can be adjusted by a signal from a step size parameter (μ) adjusting unit 7 described later.

図１における音声認識装置においては減算器４からの誤差信号ｅ（ｎ）をバンドパスフィルター８で音声帯域を抽出する処理を行った後の音声データを、音声強度計算部９でその音声データの音声強度を計算し、その値を発話有無判定部１０に入力している。発話有無判定部１０では、この音声強度計算部９の信号の他、音声入力スイッチ１１からの特に音声入力スイッチを押下した信号を入力し、またタイマー１２の信号、更には後述するような個人発話情報設定部１８の個人特有のデータを入力している。 In the speech recognition apparatus in FIG. 1, the speech data after the error signal e (n) from the subtracter 4 is extracted by the band pass filter 8 is processed by the speech intensity calculation unit 9. The voice intensity is calculated and the value is input to the utterance presence / absence determination unit 10. In the utterance presence / absence determination unit 10, in addition to the signal from the voice intensity calculation unit 9, a signal obtained by pressing the voice input switch in particular from the voice input switch 11 is input. Data specific to the individual of the information setting unit 18 is input.

発話有無判定部１０においてはこれらのデータや信号によって、後述するように発話の有無を適切に判定し、真の発話区間においてステップサイズパラメータを小さな値であるμ２にする処理を行う。また、この発話有無判定部１０において種々のデータや信号により得られた発話／非発話信号は、個人発話情報設定部１８での個人特有の情報を設定するに際して発話情報の学習を行う発話情報学習部１９にも用いる。更にこの発話／非発話信号は、無音信号生成ブロック１３を構成する切換スイッチ１４作動用の出力制御部１５にも出力し、出力制御部１５では発話信号によりｓ１側に接続して、音声認識エンジン２１に利用者の発話音声を主とする誤差信号ｅ（ｎ）を出力する。また、非発話信号によりｓ２側に接続して音声認識エンジンに音声信号を出力しないように零データを出力する。 The utterance presence / absence determination unit 10 appropriately determines the presence / absence of utterance based on these data and signals as described later, and performs processing for setting the step size parameter to a small value μ2 in the true utterance section. The speech / non-speech signal obtained from various data and signals in the speech presence / absence determination unit 10 is used to learn speech information when setting personal-specific information in the personal speech information setting unit 18. Also used for part 19. Further, this speech / non-speech signal is also output to the output control unit 15 for operating the changeover switch 14 constituting the silence signal generation block 13, and the output control unit 15 is connected to the s1 side by the speech signal, so that the speech recognition engine. An error signal e (n) mainly including the user's speech is output to 21. Further, the non-speech signal is connected to the s2 side, and zero data is output so as not to output the speech signal to the speech recognition engine.

前記個人発話情報設定部１８及び発話情報学習部１９を含む個人設定反映部１６には、個人発話設定情報を記憶する個人発話設定情報記憶部１７を備え、個人発話情報設定部１８によってこの情報を適宜更新しつつ利用する。個人設定反映部１６にはその他個人特定部２０を備え、例えば電話保有者、車両情報、予め登録した利用者の選択、指紋声紋等の生体情報等々により利用者個人を特定し、それによっても個人発話情報を適切に設定できるようにしている。 The personal setting reflection unit 16 including the personal utterance information setting unit 18 and the utterance information learning unit 19 includes a personal utterance setting information storage unit 17 for storing personal utterance setting information. The personal utterance information setting unit 18 stores this information. Use while updating as appropriate. The personal setting reflection unit 16 includes the other personal identification unit 20. For example, the individual user is identified by the telephone holder, vehicle information, selection of a pre-registered user, biometric information such as a fingerprint voiceprint, and the like. The utterance information can be set appropriately.

上記のようなブロック図からなる本発明による音声認識装置においては、例えば図２に示す作動フローにより順に作動させることができる。以下に本発明の作動を、本発明の主要機能であるステップサイズパラメータの設定を主として説明する図２の作動フローに従い、図１の機能ブロックを参照しつつ、また図３の処理例等に基づいて説明する。図２にの例においては、オーディオサウンドキャンセレーション（ＡＳＣ）システムの出力として図１の誤差信号ｅ（ｎ）に相当する、現在の誤差信号であるｅ（ｉ）が出力された後の処理として、最初に図１のバンドパスフィルター８で音声帯域を抽出するためのフィルター処理を行う（ステップＳ１）。この音声帯域の抽出に際しては、予め得られている現在の利用者が男性であるか女性であるかの個別情報等により帯域の切り替えを行い、以降の適切な処理を行うようにする。 The speech recognition apparatus according to the present invention having the block diagram as described above can be operated in sequence according to the operation flow shown in FIG. In the following, the operation of the present invention will be described in accordance with the operation flow of FIG. 2 which mainly explains the setting of the step size parameter, which is the main function of the present invention, with reference to the functional blocks of FIG. I will explain. In the example shown in FIG. 2, the processing after the current error signal e (i) corresponding to the error signal e (n) in FIG. 1 is output as the output of the audio sound cancellation (ASC) system. First, filter processing for extracting the audio band is performed by the band pass filter 8 of FIG. 1 (step S1). When extracting the voice band, the band is switched based on individual information on whether the current user obtained in advance is male or female, and the subsequent appropriate processing is performed.

次いで、そのフィルター処理を行った信号に対して、図１の音声強度計算部９において音声強度計算を行い、現在の出力信号ｐ（ｉ）を得る（ステップＳ２）。このときの信号は例えば図３（ａ）のＡＳＣ処理によって得られた誤差信号ｅ（ｎ）に基づいて、図３（ｂ）に示すような音声強度情報の信号が得られる。この信号は以降に述べるステップＳ８及びステップＳ９において用いられる。次いで、音声入力スイッチを押下したか否かを判別する（ステップＳ３）。この処理は図１における発話有無判定部１０において、音声入力スイッチ１１の信号を入力して判別する。従来はこの音声入力スイッチが押下されたときには、例えば図３（ｃ）に示すように、その時点で直ちにステップサイズパラメータμを低くしていたものであるが、本発明においては更に以降の処理を行い、できる限り発話時期に合わせて適切なμの値に設定する処理を行う。 Next, the sound intensity calculation unit 9 in FIG. 1 performs sound intensity calculation on the filtered signal to obtain the current output signal p (i) (step S2). As a signal at this time, for example, based on the error signal e (n) obtained by the ASC processing of FIG. 3A, a signal of voice intensity information as shown in FIG. 3B is obtained. This signal is used in steps S8 and S9 described below. Next, it is determined whether or not the voice input switch has been pressed (step S3). This processing is determined by inputting the signal of the voice input switch 11 in the speech presence / absence determination unit 10 in FIG. Conventionally, when this voice input switch is pressed, for example, as shown in FIG. 3 (c), the step size parameter μ is immediately reduced at that point. However, in the present invention, the subsequent processing is further performed. And set the value of μ as appropriate according to the utterance time.

ステップＳ３において音声入力スイッチが押下されたと判別したときには、この音声認識処理において用いられる各種変数の初期化を行う（ステップＳ４）。このときの状態は、音声入力スイッチがＯＮとなり、発話を行っていることを示す発話フラグは未だ０となっている。その後、前記ステップＳ３において音声入力スイッチは既に押下されていると判別したときも含めてステップＳ５に進み、音声入力スイッチがＯＮの状態の元で、始端カウント開始時刻になったか否かを判別する。即ち、本発明においては、音声入力スイッチが押下されてから、各人の個性によって実際の発話開始時刻が異なるため、発話開始時刻に近くなったときから次第にステップサイズパラメータ（μ）を減少させるフェードイン作動をおこない、実際の発話時には最もμを小さく保ち、且つ急激な変化を避けるため、ここでフェードイン作動を行うための始端カウントの開始時刻になったか否かを判別している。 When it is determined in step S3 that the voice input switch has been pressed, various variables used in the voice recognition process are initialized (step S4). In this state, the speech input switch is turned on, and the speech flag indicating that speech is being performed is still 0. Thereafter, the process proceeds to step S5, including when it is determined in step S3 that the voice input switch has already been pressed, and it is determined whether the start count start time has been reached with the voice input switch turned on. . That is, in the present invention, since the actual utterance start time differs depending on the individuality after the voice input switch is pressed, a fade that gradually decreases the step size parameter (μ) from when the utterance start time is approached. The in operation is performed, and it is determined whether or not the start time of the start end count for performing the fade-in operation is reached in order to keep μ as small as possible during actual speech and avoid a sudden change.

この処理は、図３（ｄ）において、音声入力スイッチが押下された時点ｔ１から時間のカウントを行い、図中の時間Ｔ２、即ち時刻ｔ２になったか否かを判別することにより行われる。ここで、音声入力スイッチがオンで且つ始端カウント開始時刻となったと判別したときには、発話フラグを１とし、現在がフェードイン中だと判別する（ステップＳ６）。このときの始端カウント値は開始時刻の学習に用いることができる。ステップＳ５において始端カウント開始時刻ではなく、既に開始時刻を過ぎていると判別したときには、始端カウントアップを継続する（ステップＳ７）。 This process is performed by counting time from time t1 when the voice input switch is pressed in FIG. 3D and determining whether or not time T2 in the figure, that is, time t2, has been reached. Here, when it is determined that the voice input switch is on and the start end count start time is reached, the speech flag is set to 1, and it is determined that the current state is fading in (step S6). The starting end count value at this time can be used for learning the start time. If it is determined in step S5 that the start time has already passed rather than the start end count start time, the start end count is continued (step S7).

ステップＳ６及びステップＳ７の作動の後、発話フラグが１で且つ現在の発話音量ｐ（ｉ）が発話音量の閾値以下であるか否かを判別する（ステップＳ８）。ここで現在の発話音量が閾値以下ではないと判別したときには、例えば図３（ｂ）において発話音量としての音声強度情報が閾値ｐ１より大きいと判別したときであり、したがってステップＳ１４として示すように現在発話中となっている。ここで発話音量の閾値は、音声認識エンジンの処理結果を活用して学習させることができる。 After the operations of step S6 and step S7, it is determined whether or not the utterance flag is 1 and the current utterance volume p (i) is less than or equal to the utterance volume threshold (step S8). Here, when it is determined that the current utterance volume is not lower than the threshold value, for example, in FIG. 3B, it is determined that the voice intensity information as the utterance volume is larger than the threshold value p1, and as shown in step S14, Speaking. Here, the threshold of the speech volume can be learned by utilizing the processing result of the speech recognition engine.

また、ステップＳ８において現在の発話音量が閾値以下であると判別したときには、続いて前回の発話音量ｐ（ｉ−１）が閾値以上であったか否かを判別する（ステップＳ９）。即ち、前記ステップＳ５で音声入力スイッチがオンで且つ始端カウント開始時刻と判別したことにより、ステップＳ６でμのフェードイン処理がなされ、且つ発話フラグを１にする処理を行った後において、前回の発話音量値であるｐ（ｉ−１）は発話音量の閾値以上であったか否かを判別する。即ち、発話音量が閾値を連続して下回り続けているかをチェックする。ここで、発話音量が所定値以上であるか否かは図１の発話有無判定部１０において、音声強度計算部９により得られた図３（ｂ）のようなデータに基づき、今回の値ｐ（ｉ）及び前回の値ｐ（ｉ−１）が、所定の閾値ｐ１以上であるか否かを検出することにより行われる。 If it is determined in step S8 that the current utterance volume is equal to or lower than the threshold value, it is subsequently determined whether or not the previous utterance volume p (i-1) is equal to or higher than the threshold value (step S9). That is, when the voice input switch is determined to be on and the start count start time is determined in step S5, the fade-in process of μ is performed in step S6 and the speech flag is set to 1. It is determined whether or not p (i-1) that is the utterance volume value is equal to or greater than the threshold value of the utterance volume. That is, it is checked whether the utterance volume is continuously below the threshold value. Here, whether or not the utterance volume is equal to or higher than a predetermined value is determined based on the current value p based on the data as shown in FIG. This is performed by detecting whether (i) and the previous value p (i-1) are equal to or greater than a predetermined threshold value p1.

前記ステップＳ８で今回の発話音量が閾値以下であり、且つステップＳ９で前回の発話音量が閾値以上であると判別したとき、即ち、図３（ｂ）の時刻ｔ５において、終端カウントを０にセットする（ステップＳ１０）。また、ステップＳ９で前回の発話音量が閾値以上ではないと判別したときには、終端のカウントを継続する（ステップＳ１１）。 When it is determined in step S8 that the current utterance volume is equal to or lower than the threshold value and in step S9 that the previous utterance volume is equal to or higher than the threshold value, that is, at time t5 in FIG. (Step S10). If it is determined in step S9 that the previous utterance volume is not greater than or equal to the threshold, the termination count is continued (step S11).

ステップＳ１０及びステップＳ１１の作動の後は、終端カウントは所定間隔長以上か否かを判別する（ステップＳ１２）。この所定間隔長は、図３（ｄ）においてはＴ６の時間であり、例えば１秒程度に設定することができる。この期間の設定は、入力音声が１単語で終わらないことを考慮して設定されている。また、この時の終端カウント値は間隔長の学習に用いることができる。 After the operations of step S10 and step S11, it is determined whether the end count is equal to or longer than a predetermined interval length (step S12). This predetermined interval length is time T6 in FIG. 3D, and can be set to about 1 second, for example. This period is set in consideration of the fact that the input speech does not end with one word. Further, the end count value at this time can be used for interval length learning.

ステップＳ１２において終端カウントが所定間隔長以上ではない、と判別したときには、未だ発話中であるとして（ステップＳ１４）、次のステップＳ１５に進む。また、ステップＳ１２において終端カウントが所定間隔長以上になったと判別したときには、ステップＳ１３において現在フェードアウト状態だと判別する。この状態は図３（ｄ）において時刻ｔ６以降でＴ７の時間に行う。 If it is determined in step S12 that the termination count is not greater than or equal to the predetermined interval length, it is determined that speech is still being performed (step S14), and the process proceeds to the next step S15. If it is determined in step S12 that the end count has become equal to or longer than the predetermined interval length, it is determined in step S13 that the current fade-out state is present. This state is performed at time T7 after time t6 in FIG.

図２の例においては上記のような作動を行った後、ステップＳ１５以降ステップサイズパラメータμの各種処理を行う。即ち、ステップＳ１５において現在発話中であるか否かの判別を行い、現在発話中であると判別したときにはμを最小の値とする。これは図３（ｄ）においてＴ５の期間、或いはＴ４及びＴ６を含んだ期間となる。また、ステップＳ１５において現在発話中ではないと判別したときには、ステップＳ１７において現在はフェードアウト中であるか否かを判別する。ここでフェードアウト中の期間に相当する時点であると判別したときには、ステップＳ１８においてμのフェードアウト処理を行う。この処理は図３（ｄ）においてＴ７の期間となる。このときの処理は決められた区間の線形補間により徐々にフェードアウト処理を行う。 In the example of FIG. 2, after performing the above operation, various processes of the step size parameter μ are performed after step S <b> 15. That is, in step S15, it is determined whether or not the current speech is being performed. If it is determined that the current speech is being performed, μ is set to the minimum value. This is a period T5 in FIG. 3D or a period including T4 and T6. If it is determined in step S15 that the current speech is not being spoken, it is determined in step S17 whether or not the current speech is fading out. If it is determined that the time corresponds to the period during the fade-out, the fade-out process of μ is performed in step S18. This process is a period T7 in FIG. At this time, the fade-out process is gradually performed by linear interpolation in a predetermined section.

ステップＳ１５で現在発話中ではないと判別し、ステップＳ１７で現在フェードアウト中ではないと判別したときには、ステップＳ１９において、現在フェードイン中の期間に相当する時点であるか否かを判別する。ここで現在はフェードイン中の期間に相当する時点であると判別したときには、μのフェードイン処理を行う（ステップＳ２０）。この処理は図３（ｄ）においてＴ３の期間となる。また、このときの処理は前記フェードアウト処理と同様に、決められた区間の線形補間によって徐々にフェードイン処理を行う。また、ステップＳ１９において、現在フェードイン中ではないと判別したとき、即ちステップＳ１５で現在発話中ではなく、ステップＳ１７で現在フェードアウト中ではないと判別し、ステップＳ１９で現在フェードイン中でもないと判別したときには、音声認識処理を行わないことにより、μの値を所定の最大値とする（ステップＳ２１）。 If it is determined in step S15 that it is not currently speaking, and it is determined in step S17 that it is not currently fading out, it is determined in step S19 whether or not it is a time corresponding to the current fade-in period. Here, when it is determined that the current time corresponds to the period during the fade-in, the fade-in process of μ is performed (step S20). This process is a period T3 in FIG. In this process, as in the fade-out process, the fade-in process is gradually performed by linear interpolation of the determined section. In step S19, when it is determined that the current fade-in is not being performed, that is, it is determined in step S15 that it is not currently speaking, it is determined in step S17 that it is not currently fade-out, and it is determined in step S19 that it is not currently fade-in. Sometimes, the value of μ is set to a predetermined maximum value by not performing the voice recognition process (step S21).

上記のような各種処理を行った後、ステップＳ２２において音声認識エンジンへの出力を現在の誤差信号ｅ（ｉ）に発話フラグを乗算したもの、即ち発話フラグが０の時は出力を０とし、発話フラグが存在するときには前記のような処理を行った後の誤差信号を音声認識エンジンに出力する。この処理は図１における無信号生成部１３において行う。上記処理の後、フィルタ計数の更新処理を行い、以降同様の作動を繰り返す。 After performing the various processes as described above, in step S22, the output to the speech recognition engine is obtained by multiplying the current error signal e (i) by the utterance flag, that is, when the utterance flag is 0, the output is 0. When the utterance flag exists, the error signal after the above processing is output to the speech recognition engine. This processing is performed in the no-signal generating unit 13 in FIG. After the above processing, filter count update processing is performed, and thereafter the same operation is repeated.

上記のような処理において、特に図１の個人設定反映部１６に示すように、前記各処理における種々の設定時間や設定値の発話情報を徐々に学習させ、各個人に対応した適切な設定値とすることができる。即ち、個人設定反映部１６では、本発明で使用されるパラメータである、発話開始時間、発話間隔、発話音量は個人ごとにほぼ同じような値になるとの仮定のもとに、学習によるカスタマイズ、設定値の反映を行うことで更なる性能改善を行う。具体的には、電話保持者や車両情報、声紋などの生体情報をキーとして、あらかじめ登録された個人設定データベースからパラメータａをロードする。発話情報学習部にて実際に認識エンジンを使った際のパラメータ情報ｔを算出し、これらの情報をａ’＝ａ＋ｋ（ａ−ｔ）の更新式により更新する。ここでｋは学習結果を反映する微小な係数である。ここで、使用者が切替るとパラメータａ’をデータベースに格納し、新たにパラメータｂをロードする。 In the processing as described above, particularly as shown in the personal setting reflection unit 16 of FIG. 1, the utterance information of various setting times and setting values in each processing is gradually learned, and appropriate setting values corresponding to each individual are set. It can be. That is, the personal setting reflection unit 16 is customized by learning on the assumption that the utterance start time, the utterance interval, and the utterance volume, which are parameters used in the present invention, are substantially the same for each individual. The performance is further improved by reflecting the set value. Specifically, the parameter a is loaded from a personal setting database registered in advance using the biometric information such as the telephone holder, vehicle information, and voiceprint as a key. The utterance information learning unit calculates parameter information t when the recognition engine is actually used, and updates the information by an update formula of a ′ = a + k (at). Here, k is a minute coefficient reflecting the learning result. Here, when the user switches, the parameter a 'is stored in the database and the parameter b is newly loaded.

本発明の実施例の機能ブロック図である。It is a functional block diagram of the Example of this invention. 本発明の実施例の作動フロー図である。It is an operation | movement flowchart of the Example of this invention. 本発明の信号処理を説明する図である。It is a figure explaining the signal processing of this invention. 本発明で用いられる適応フィルタの作動を説明する図である。It is a figure explaining the action | operation of the adaptive filter used by this invention. 従来の音声認識装置の機能ブロック図である。It is a functional block diagram of the conventional speech recognition apparatus. 従来の装置における信号処理を説明する図である。It is a figure explaining the signal processing in the conventional apparatus.

Explanation of symbols

１スピーカ
２利用者
３マイク
４減算器
５適応フィルタ
６ＬＭＳアルゴリズム
７ステップサイズパラメータ（μ）調整部
８バンドパスフィルタ
９音声強度計算部
１０発話有無判定部
１１音声入力スイッチ
１２タイマー
１３無音信号生成部
１４切換スイッチ
１５出力制御部
１６個人設定反映部
１７個人発話設定情報記憶部
１８個人発話情報設定部
１９発話情報学習部
２０個人特定部
２１音声認識エンジン
1 Speaker
2 User 3 Microphone 4 Subtractor 5 Adaptive filter 6 LMS algorithm 7 Step size parameter (μ) adjustment unit 8 Bandpass filter 9 Voice intensity calculation unit 10 Speech presence / absence determination unit 11 Voice input switch 12 Timer 13 Silent signal generation unit 14 Switching Switch 15 Output control unit 16 Personal setting reflection unit 17 Personal utterance setting information storage unit 18 Personal utterance information setting unit 19 Utterance information learning unit 20 Personal identification unit 21 Voice recognition engine

Claims

A microphone that collects the user's voice and audio input to the voice recognition device;
An adaptive filter that inputs an audio signal that outputs the audio sound and changes a tap coefficient by an adaptive algorithm using a step size parameter;
A subtractor for inputting the output signal of the adaptive filter and the signal from the microphone;
In the speech recognition apparatus, the error signal of both signals output from the subtracter is input to the adaptive algorithm and output to the speech recognition engine.
Voice intensity calculation means for calculating the voice intensity of the error signal;
Utterance presence / absence determining means for obtaining a predetermined time from the time when the voice intensity calculated by the voice intensity calculating means is switched from a threshold value to a preset threshold value to
A speech recognition apparatus comprising: a step size parameter adjusting unit that gradually increases a step size parameter that has been decreased in advance when the speech presence / absence determining unit determines that there is no speech.

2. The speech recognition apparatus according to claim 1, wherein the step size parameter adjusting means gradually decreases the step size parameter after a predetermined period after the user presses the voice input switch.

The speech recognition apparatus according to claim 1, wherein the speech intensity threshold is learned and changed according to a speech recognition processing result.

2. The voice recognition apparatus according to claim 1, wherein a predetermined time set in advance from a time when the voice intensity is switched from a preset threshold value to a preset threshold value is learned and changed based on a voice recognition process result.

3. The voice recognition apparatus according to claim 2, wherein a predetermined period after the voice input switch is pressed is learned and changed according to a voice recognition processing result.

2. The speech recognition apparatus according to claim 1, wherein the utterance presence / absence determination unit changes the determination criterion according to personal information of a personal utterance information setting unit preset for each user.

2. The speech recognition engine according to claim 1, further comprising output control means for outputting zero data to the speech recognition engine when the speech presence / absence determining means determines that there is no speech.