JP3163109B2

JP3163109B2 - Multi-directional simultaneous voice pickup speech recognition method

Info

Publication number: JP3163109B2
Application number: JP08664591A
Authority: JP
Inventors: 敬三木
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1991-04-18
Filing date: 1991-04-18
Publication date: 2001-05-08
Anticipated expiration: 2016-05-08
Also published as: JPH04318900A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は多方向からの音声を同
時収音して認識する多方向同時収音式音声認識方法に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multi-directional simultaneous sound pickup speech recognition method for simultaneously picking up and recognizing sounds from multiple directions.

【０００２】[0002]

【従来の技術】音声認識装置はコンピューター、その他
の種々の機器の有力な入力手段として利用され始めてい
る。図４は特開昭６２−７３２９８号公報に示されるよ
うな従来の認識装置の典型例のブロック図であり、４０
１は音声入力端子、４０２は音声分析部、４０３は音声
区間検出部、４０４は比較マッチングパタンメモリー
部、４０５は類似度計算部、４０６は判定部である。2. Description of the Related Art Speech recognition devices have begun to be used as powerful input means for computers and various other devices. FIG. 4 is a block diagram of a typical example of a conventional recognition device as disclosed in Japanese Patent Application Laid-Open No. 62-73298.
1 is a voice input terminal, 402 is a voice analysis unit, 403 is a voice section detection unit, 404 is a comparison matching pattern memory unit, 405 is a similarity calculation unit, and 406 is a determination unit.

【０００３】音声入力端子４０１から入力された音声信
号は音声分析部４０２において音声特徴を表す特徴ベク
トルの時系列に変換される。[0003] A voice signal input from a voice input terminal 401 is converted by a voice analysis unit 402 into a time series of feature vectors representing voice features.

【０００４】音声区間検出部４０３では音声分析部４０
２からの特徴ベクトルに基づいて音声の存在する区間、
即ち音声区間を決定する。この音声始端から音声終端ま
での特徴ベクトル系列を音声マッチングパタンと呼ぶ。[0004] The voice section detection section 403 includes a voice analysis section 40.
Section where voice is present based on the feature vector from 2,
That is, the voice section is determined. This feature vector sequence from the speech start to the speech end is called a speech matching pattern.

【０００５】次に比較マッチングパタンメモリー部４０
４の機能について説明する。話者を限定する特定話者音
声認識では、認識対象となる単語（カテゴリと称する）
を予め発声し（学習音声と称する）、同一の音声分析を
施して求められた音声マッチングパタン（比較マッチン
グパタンと称する）を格納しておく必要がある。比較パ
タンメモリー部４０４にはこのような比較マッチングパ
タンが格納されている。この比較マッチングパタンの格
納動作を登録処理と呼ぶ。又話者を限定しない音声認識
（不特定話者音声認識と称する）の場合には、種々の話
者の比較マッチングパタンが比較パタンメモリー部４０
４に予め格納されている。Next, a comparison matching pattern memory section 40
The function of No. 4 will be described. In specific speaker speech recognition for limiting a speaker, a word to be recognized (referred to as a category)
Must be uttered in advance (referred to as learning speech), and a speech matching pattern (referred to as a comparison matching pattern) obtained by performing the same speech analysis must be stored. The comparison pattern memory unit 404 stores such a comparison matching pattern. The operation of storing the comparison matching pattern is called a registration process. In the case of speech recognition that does not limit a speaker (referred to as unspecified speaker speech recognition), comparison matching patterns of various speakers are stored in a comparison pattern memory unit 40.
4 is stored in advance.

【０００６】類似度計算部４０５では認識させようとす
る入力音声から生成された入力マッチングパタンと比較
マッチングパタンとの類似度計算を行う。類似度計算は
公知の手法であるＤＰマッチングや特開昭６２−７３２
９９号公報に示されるような簡便な線形マッチングをは
じめとして種々の方法が提案されており何れかの適切な
方法で類似度計算が行われる。The similarity calculator 405 calculates the similarity between an input matching pattern generated from input speech to be recognized and a comparison matching pattern. Similarity calculation is performed by a well-known method such as DP matching or JP-A-62-732.
Various methods have been proposed, including simple linear matching as disclosed in Japanese Patent Publication No. 99-99, and similarity calculation is performed by any appropriate method.

【０００７】この類似度計算部４０５から出力されるカ
テゴリ毎の類似度を用いて、判定部４０６ではその最大
類似度を与える比較マッチングパタンに与えられたカテ
ゴリ名を認識結果として出力する。Using the similarity for each category output from the similarity calculator 405, the determiner 406 outputs the category name given to the comparison matching pattern giving the maximum similarity as a recognition result.

【０００８】[0008]

【発明が解決しようとする課題】上述した従来の認識装
置では、１本のマイクから音声を収音しているために、
接話形マイクのような発声器官とマイクとの距離、方向
が一定している入力形態では、常にその音声認識装置が
持つ認識能力が最大限発揮出来ていた。しかしながら、
発声器官とマイクとの距離、方向が大きく変化するよう
な入力形態では、認識性能が極端に低下する事が多々有
った。更に接話形マイクを使用した場合でも、周囲の背
景雑源が大きく変化するような環境では、やはり認識性
能が安定せず、誤動作や誤認識が起こる場合があった。
この発明の目的は、発声器官とマイクとの距離、方向が
大きく変化し、且つ背景雑音環境が大きく変化するよう
な環境下でも安定した高い認識性能が得られる音声認識
方法を提供するものである。In the above-described conventional recognition device, since sound is collected from one microphone,
In an input form in which the distance and direction between the vocal organ and the microphone, such as a close-talking microphone, are constant, the recognition ability of the voice recognition device has always been able to be fully exhibited. However,
In an input mode in which the distance and direction between the vocal organ and the microphone change greatly, the recognition performance often decreases extremely. Furthermore, even in the case where a close-talking microphone is used, in an environment where the background background noise greatly changes, the recognition performance is still not stable, and a malfunction or erroneous recognition may occur.
SUMMARY OF THE INVENTION An object of the present invention is to provide a voice recognition method capable of obtaining a stable and high recognition performance even in an environment where the distance and direction between a vocal organ and a microphone change greatly and the background noise environment changes greatly. .

【０００９】[0009]

【課題を解決するための手段】この目的の達成を図るた
めの本音声認識方法は、第一の構成として、複数のマイ
クから同時に音声を収音し、複数の入力信号を得る処理
と、こうして得られた複数の入力信号を各々独立に音声
識別して複数の識別結果を得る処理と、これら複数の識
別結果を統合判定する処理とを備えたことを特徴とす
る。この時、統合判定としては、複数の入力の過半数が
一致した場合に、その一致した識別結果を最終識別結果
として判定する、複数の入力のうち、識別確度が最大と
なるものを最終識別結果として判定する、あるいは、複
数の入力の中で、信号対雑音比が最大となるものを最終
識別結果として判定する、などの方法がある。According to a first aspect of the present invention, there is provided a voice recognition method for achieving this object, wherein voices are simultaneously collected from a plurality of microphones to obtain a plurality of input signals. It is characterized by comprising a process of obtaining a plurality of identification results by independently performing voice recognition on the obtained plurality of input signals, and a process of integrally determining the plurality of identification results. At this time, when the majority of the plurality of inputs match, the integrated identification result is determined as the final identification result. Of the plurality of inputs, the input with the highest identification accuracy is determined as the final identification result. There is a method of judging, or judging the input having the largest signal-to-noise ratio among a plurality of inputs as the final identification result.

【００１０】また本音声認識方法は、第二の構成とし
て、前記音声認識方法により確定した認識結果に基づい
て、各入力系における当該音声信号／背景雑音比が最も
良い主入力系と、当該音声信号／背景雑音比の最も悪い
ノイズ入力系とを決定する処理と、主入力系とノイズ入
力系より適応ノイズ除去フィルタを構成する処理とを備
え、以降の認識処理では前記適応ノイズ除去フィルタリ
ング後の入力信号より音声識別することを特徴とする。The second aspect of the present speech recognition method is that, based on the recognition result determined by the speech recognition method, the main input system having the best speech signal / background noise ratio in each input system and the speech A process for determining a noise input system having the worst signal / background noise ratio; and a process for configuring an adaptive noise elimination filter from the main input system and the noise input system. It is characterized in that speech is identified from an input signal.

【００１１】[0011]

【作用】第一の構成は、複数のマイクから同時に収音し
た複数の入力信号に対してそれぞれ独立に識別動作を行
い、この複数の識別結果及び認識補助情報を統合判定部
で総合的に判定することにより認識判定を行うものであ
り、また第二の構成は第一の構成による認識判定により
確定した認識結果に基づいて適応ノイズ除去フィルタを
構成し、以降はこのフィルタリング処理後の入力信号よ
り認識判定を行うもので、上記構成によれば、雑音発声
源の位置を考慮して最も信号・雑音比の良い入力信号に
より音声認識を行い得るものである。従って、本発明の
音声認識方法を用いれば発声器官とマイクとの距離、方
向が大きく変化し、且つ背景雑音環境が大きく変化する
ような環境下でも安定した高い認識性能が得られる音声
認識装置を実現出来る。According to the first configuration, an identification operation is performed independently on a plurality of input signals picked up simultaneously from a plurality of microphones, and the plurality of identification results and recognition auxiliary information are comprehensively determined by an integrated determination unit. The second configuration constitutes an adaptive noise elimination filter based on the recognition result determined by the recognition determination by the first configuration, and thereafter, the input signal after the filtering process is performed. According to the above configuration, recognition is performed, and speech recognition can be performed using an input signal having the best signal-to-noise ratio in consideration of the position of the noise utterance source. Therefore, by using the voice recognition method of the present invention, a voice recognition device capable of obtaining a stable and high recognition performance even in an environment in which the distance and direction between the vocal organ and the microphone greatly changes and the background noise environment greatly changes. Can be realized.

【００１２】[0012]

【実施例】以下、本発明の実施例について述べる。な
お、ここでは音声マッチングパタンとは、入力マッチン
グパタンと比較マッチングパタンとに共通した生成過程
で作られるパタンを意味している。Embodiments of the present invention will be described below. Here, the voice matching pattern means a pattern created in a generation process common to the input matching pattern and the comparison matching pattern.

【００１３】図１は登録動作を行う特定話者認識方式に
適用した本発明の第一の実施例を示すブロック図であっ
て、１０１，１０２，１０３はそれぞれ第１チャンネ
ル、第２チャンネル、第３チャンネルの音声入力端子、
１０４，１０５，１０６はそれぞれ第１チャンネル、第
２チャンネル、第３チャンネルの音声分析部、１０７，
１０８，１０９はそれぞれ第１チャンネル、第２チャン
ネル、第３チャンネルの音声区間検出部、１１０，１１
１，１１２はそれぞれ第１チャンネル、第２チャンネ
ル、第３チャンネルの比較パタンメモリー部、１１３，
１１４，１１５はそれぞれ第１チャンネル、第２チャン
ネル、第３チャンネルの音声識別部、１１６は統合判定
部、１１７は出力端子である。FIG. 1 is a block diagram showing a first embodiment of the present invention applied to a specific speaker recognition system for performing a registration operation. Reference numerals 101, 102, and 103 denote a first channel, a second channel, and a second channel, respectively. 3 channel audio input terminal,
Reference numerals 104, 105, and 106 denote sound analysis units for the first channel, the second channel, and the third channel, respectively.
Reference numerals 108 and 109 denote voice section detectors for the first, second and third channels, respectively.
Reference numerals 1 and 112 denote comparison pattern memory sections of the first channel, the second channel and the third channel, respectively.
Reference numerals 114 and 115 denote a first-channel, second-channel, and third-channel audio discriminating units, respectively, an integrated determining unit 116, and an output terminal 117.

【００１４】尚第一の実施例では説明を簡略化するため
に音声入力信号として３回線の例を示しているが、更に
多くの回線数を設けても何等差し支えない。又ここで１
０１，１０４，１０７，１１０，１１３の各機能ブロッ
クをまとめて第１チャンネル音声処理部と呼ぶ。同様に
１０２，１０５，１０８，１１１，１１４の各機能ブロ
ックをまとめて第２チャンネル音声処理部と呼び、１０
３，１０６，１０９，１１２，１１５の各機能ブロック
をまとめて第３チャンネル音声処理部と呼ぶ。Although the first embodiment shows an example of three audio input signals for simplicity of explanation, it is possible to provide a larger number of audio signals. Also here 1
The functional blocks 01, 104, 107, 110, and 113 are collectively referred to as a first channel audio processing unit. Similarly, functional blocks 102, 105, 108, 111, and 114 are collectively called a second channel audio processing unit.
The function blocks 3, 106, 109, 112, and 115 are collectively referred to as a third channel audio processing unit.

【００１５】先ず第１チャンネルに入力された音声信号
の識別処理を説明する。音声入力端子１０１から入力さ
れた音声信号は音声分析部１０４において音声の特徴を
表す特徴ベクトルの時系列に変換される。特徴ベクトル
の導出方法には、中心周波数が少しずつ異なる複数のバ
ンドパス群を用いる方法や、ＦＦＴ（高速フーリエ変
換）よるスペクトル分析を用いるもの等々考えられる
が、ここではバンドパスフィルタ群を使用する方法を例
に挙げる。First, the process of identifying an audio signal input to the first channel will be described. The audio signal input from the audio input terminal 101 is converted into a time series of feature vectors representing the features of the audio in the audio analysis unit 104. As a method for deriving the feature vector, a method using a plurality of bandpass groups having slightly different center frequencies, a method using spectral analysis by FFT (Fast Fourier Transform), and the like can be considered. Here, a bandpass filter group is used. The method is taken as an example.

【００１６】音声信号は音声分析部１０４にてアナログ
・デジタル変換された後、各バンドパスフィルタによっ
てその周波数成分のみを抽出する。この様にして各バン
ドパスフィルタによって振り分けられたデータの系列を
チャネルと称する。各チャネル毎のフィルタの出力に対
して整流して絶対値を取りフレーム単位でその平均値を
算出する。この算出値がそのフレームにおける各チャネ
ルの特徴ベクトルの大きさになる。即ち、ｉ番目のフレ
ームにおける特徴ベクトルの大きさＡｉｊは、Ａｉｊ＝（Ａｉ１，Ａｉ２，・・・，Ａｉｐ）となる。ここでｐはチャネル数である。After the audio signal is converted from analog to digital by the audio analyzer 104, each band-pass filter extracts only its frequency component. The sequence of data sorted by each bandpass filter in this manner is called a channel. The output of the filter for each channel is rectified to obtain an absolute value, and the average value is calculated for each frame. This calculated value becomes the magnitude of the feature vector of each channel in the frame. That is, the magnitude Aij of the feature vector in the ith frame is Aij = (Ai1, Ai2,..., Aip). Here, p is the number of channels.

【００１７】音声区間検出部１０７では音声分析部１０
４からの特徴ベクトルに基づいて音声の存在する区間、
即ち音声区間を決定する。In the voice section detection section 107, the voice analysis section 10
Section where voice is present based on the feature vector from
That is, the voice section is determined.

【００１８】比較パタンメモリー部１１０には認識対象
カテゴリの比較マッチングパタンが格納されており、こ
の比較マッチングパタンの格納動作を登録処理と呼ぶ。
ここでは登録処理の例として、説明の簡単化のため１カ
テゴリ当り１回の学習音声を発声する場合を取り上げ
る。カテゴリの総数をＮとした場合、比較パタンメモリ
にはＮ個の比較マッチングパタンＳｌｎ（ｎ＝１，２，
・・・，Ｎ）が格納される。The comparison pattern memory unit 110 stores a comparison matching pattern of a recognition target category, and the operation of storing the comparison matching pattern is called a registration process.
Here, as an example of the registration processing, a case where one learning voice is uttered once per category is taken for simplification of description. When the total number of categories is N, the comparison pattern memory has N comparison matching patterns Sln (n = 1, 2, 2).
.., N) are stored.

【００１９】音声のマッチングパタン同士の比較では、
両者の時間的な対応をとる必要がある。最適な対応をと
りながら両者のパタン間の類似度を算出する代表的な方
法に特公昭５０−２３９４１号に示されている様な通称
ＤＰマッチング方法がある。音声識別部１１３では、こ
の様なＤＰマッチング法もしくはその他の好適な方法を
用いてパタン間の類似度計算を行う。即ち、認識させよ
うとする入力音声から生成された入力マッチングパタン
Ｉ１と比較パタンメモリー部１１０中の全ての比較マッ
チングパタンＳｌｎとの類似度を求め、音声特徴類似度
Ｘｌｎを得る。In comparison between matching patterns of voice,
It is necessary to take a time response between the two. A typical method of calculating the similarity between the two patterns while taking the optimum correspondence is a so-called DP matching method as shown in Japanese Patent Publication No. 50-23941. The voice identification unit 113 calculates the similarity between patterns using such a DP matching method or another suitable method. That is, the similarity between the input matching pattern I1 generated from the input speech to be recognized and all the comparison matching patterns SIn in the comparison pattern memory unit 110 is obtained, and the speech feature similarity XIn is obtained.

【００２０】このパタン毎の音声特徴類似度Ｘｌｎの
内、その最大値Ｐ１を与える比較マッチングパタンに与
えられたカテゴリ名Ｃ１をそのチャンネルにおける音声
識別結果として統合判定部１１６へ出力する。更に識別
補助情報として、音声特徴類似度の最大値Ｐ１（識別確
度と称する）、音声の開始終了時刻ＶＳ１，ＶＥ１、音
声レベル（音声区間の信号レベル）と雑音レベル（音声
区間直前の無入力状態での信号レベル）の比ＳＮ１（信
号／雑音レシオと称する）等も統合判定部１１６へ出力
する。Among the voice feature similarities Xln for each pattern, the category name C1 given to the comparison matching pattern giving the maximum value P1 is output to the integration determination section 116 as the voice identification result for the channel. Further, as the identification auxiliary information, the maximum value P1 of the audio feature similarity (referred to as identification accuracy), the start and end times VS1, VE1 of the audio, the audio level (the signal level of the audio section) and the noise level (the non-input state immediately before the audio section). Also, the ratio SN1 (referred to as signal / noise ratio) of the signal level is output to the integration determination unit 116.

【００２１】第２チャンネルに入力された音声信号の識
別処理も上述した第１チャンネルの場合と全く同様にし
て処理を行う。即ち音声入力端子１０２から入力された
音声信号は音声分析部１０５において音声特徴を表す特
徴ベクトルの時系列に変換される。音声区間検出部１０
８では音声分析部１０５からの特徴ベクトルに基づいて
音声区間を決定し、その部分の特徴ベクトル系列、即ち
入力マッチングパタンＩ２が生成される。次に音声識別
部１１４で入力マッチングパタンＩ２と比較パタンメモ
リー部１１１中の全ての比較マッチングパタンＳ２ｎと
の類似度を求め、音声特徴類似度Ｘ２ｎを得る。The process of identifying the audio signal input to the second channel is performed in exactly the same manner as in the case of the first channel. That is, the audio signal input from the audio input terminal 102 is converted by the audio analysis unit 105 into a time series of feature vectors representing audio features. Voice section detection unit 10
In step 8, a speech section is determined based on the feature vector from the speech analysis unit 105, and a feature vector sequence of that portion, that is, an input matching pattern I2 is generated. Next, the speech identification unit 114 calculates the similarity between the input matching pattern I2 and all the comparison matching patterns S2n in the comparison pattern memory unit 111, and obtains the speech feature similarity X2n.

【００２２】このパタン毎の音声特徴類似度Ｘ２ｎの
内、その最大値を与える比較マッチングパタンに与えら
れたカテゴリ名Ｃ２をそのチャンネルにおける音声識別
結果として統合判定部１１６へ出力する。更に識別補助
情報として、識別確度Ｐ２、音声の開始終了時刻ＶＳ
２，ＶＥ２、信号／雑音レシオＳＮ２等も統合判定部１
１６へ出力する。Among the voice feature similarities X2n for each pattern, the category name C2 given to the comparison matching pattern giving the maximum value is output to the integration determination unit 116 as a voice identification result for the channel. Further, as identification auxiliary information, the identification accuracy P2, the voice start / end time VS
2, VE2, signal / noise ratio SN2, etc.
16 is output.

【００２３】第３チャンネルに入力された音声信号の識
別処理も上述した第１チャンネルの場合と全く同様に行
われ、統合判定部１１６へ音声識別結果Ｃ３、と識別補
助情報（識別確度Ｐ３、音声の開始終了時刻ＶＳ３，Ｖ
Ｅ３、信号／雑音レシオＳＮ３等）が出力される。The process of identifying the audio signal input to the third channel is performed in exactly the same manner as in the case of the first channel, and the integrated determination unit 116 sends the audio identification result C3 and the identification auxiliary information (identification accuracy P3, audio Start and end times VS3, V
E3, signal / noise ratio SN3, etc.).

【００２４】統合判定部１１６では各チャンネルの音声
処理部から出力された音声識別結果、及び識別補助情報
を総合して判定し、最終認識結果を出力端子１１７より
外部ホストなどに送出する。この判定方法については種
々考えられるが、音声始端、終端時刻がほぼ同じである
識別情報群について、その過半数の識別結果が一致した
場合のみ、その識別結果を有効とする方法や、識別確度
Ｐが最も高い識別結果を有効と定める方法、又信号／雑
音レシオＳＮが最も高いチャンネルの識別結果を有効と
する方法等が挙げられる。上述した判定法もしくはその
他の好適な方法を用いて最終判定結果を算出すればよ
い。The integrated determination unit 116 makes a comprehensive determination based on the audio identification result and the identification auxiliary information output from the audio processing unit of each channel, and sends the final recognition result from the output terminal 117 to an external host or the like. Various methods can be considered for this determination method. For a group of identification information whose voice start and end times are almost the same, only when the majority of the identification results match, the method of validating the identification result or the identification accuracy P A method of determining the highest discrimination result as valid, a method of validating the discrimination result of the channel having the highest signal / noise ratio SN, and the like are given. The final determination result may be calculated using the above-described determination method or another suitable method.

【００２５】図２は本発明の第二の実施例を示すブロッ
ク図である。第二の実施例では３つの音声入力信号２０
１，２０２，２０３に対して音声分析部２０４、音声区
間検出部２０７、音声識別部２１３を時分割に動作させ
てそれぞれ独立に音声識別を行い、音声識別結果Ｃ１，
Ｃ２，Ｃ３と識別補助情報（識別確度Ｐ１，Ｐ２，Ｐ
３、音声の開始時刻ＶＳ１，ＶＳ２，ＶＳ３、音声の終
了時刻ＶＥ１，ＶＥ２，ＶＥ３、信号／雑音レシオＳＮ
１，ＳＮ２，ＳＮ３）を求める。また比較パタンメモリ
ー部２１０は共通とする。総合判定部２１６の処理は第
一の実施例と同様であり、最終認識結果を出力端子２１
７より外部ホストなどに送出する。FIG. 2 is a block diagram showing a second embodiment of the present invention. In the second embodiment, three audio input signals 20
The speech analysis unit 204, the speech section detection unit 207, and the speech identification unit 213 are operated in a time-division manner with respect to 1, 202, and 203 to perform speech identification independently.
C2, C3 and identification auxiliary information (identification accuracy P1, P2, P
3. Voice start time VS1, VS2, VS3, voice end time VE1, VE2, VE3, signal / noise ratio SN
1, SN2, SN3). The comparison pattern memory unit 210 is common. The processing of the overall judgment unit 216 is the same as that of the first embodiment, and the final recognition result is output to the output terminal 21.
7 to an external host or the like.

【００２６】図３は本発明の第三の実施例を示すブロッ
ク図である。第三の実施例の動作は雑音適応処理モード
と未雑音適応処理モードに分けられる。未雑音適応処理
モードでは適応ノイズ除去部３１８を外した状態とな
り、認識処理の動作は第二の実施例と全く同様である。
雑音適応処理モードには未雑音適応処理モードで音声認
識を実行した後移行する。雑音適応処理モードでは認識
に先立って以下の処理を行う。FIG. 3 is a block diagram showing a third embodiment of the present invention. The operation of the third embodiment is divided into a noise adaptive processing mode and a non-noise adaptive processing mode. In the non-noise adaptive processing mode, the adaptive noise removing unit 318 is removed, and the operation of the recognition processing is exactly the same as in the second embodiment.
After performing the speech recognition in the non-noise adaptive processing mode, the processing shifts to the noise adaptive processing mode. In the noise adaptive processing mode, the following processing is performed prior to recognition.

【００２７】直前の認識時の音声識別結果Ｃ１，Ｃ２，
Ｃ３と識別補助情報（識別確度Ｐ１，Ｐ２，Ｐ３、音声
の開始時刻ＶＳ１，ＶＳ２，ＶＳ３、音声の終了時刻Ｖ
Ｅ１，ＶＥ２，ＶＥ３、信号／雑音レシオＳＮ１，ＳＮ
２，ＳＮ３）及び最終確認後の認識結果Ｃｒ（外部ホス
トから送出される）から主入力系、ノイズ入力系を決定
する。決定アルゴリズムは以下の通りである。The speech recognition results C1, C2,
C3 and identification auxiliary information (identification accuracy P1, P2, P3, audio start time VS1, VS2, VS3, audio end time V
E1, VE2, VE3, signal / noise ratio SN1, SN
2, SN3) and a recognition result Cr (sent from an external host) after the final confirmation determines a main input system and a noise input system. The decision algorithm is as follows.

【００２８】（１）最終確認後の認識結果Ｃｒと一致す
る音声識別結果が出力された入力系を正解入力系として
選択する。もしそのような入力系がなければ雑音適応モ
ードへは移行せず次回の認識処理も未雑音適応モードで
行う。（２）正解入力系中で最も信号／雑音レシオＳＮが高い
入力系を主入力系とする。（３）主入力系以外の入力系の内で最も信号／雑音レシ
オＳＮが低い入力系をノイズ入力系とする。(1) An input system that has output a speech recognition result that matches the recognition result Cr after the final confirmation is selected as a correct answer input system. If there is no such input system, the operation does not shift to the noise adaptive mode, and the next recognition processing is also performed in the non-noise adaptive mode. (2) The input system having the highest signal / noise ratio SN among the correct input systems is defined as the main input system. (3) Among the input systems other than the main input system, the input system with the lowest signal / noise ratio SN is defined as the noise input system.

【００２９】次に雑音適応処理モードでの認識処理に付
いて説明する。先ず適応ノイズ除去部３１８で主入力系
の入力信号とノイズ入力系の入力信号より適応ノイズ除
去フィルタリングを行いノイズ除去音声信号を生成す
る。この後はノイズ除去音声信号に対して第一の実施例
と同様に音声識別を行う。即ちノイズ除去音声信号は音
声分析部３０４において特徴を表す特徴ベクトルの時系
列に変換される。音声区間検出部３０７では音声分析部
３０４からの特徴ベクトルに基づいて音声区間を決定
し、その部分の特徴ベクトル系列、即ち入力マッチング
パタンＩが生成される。次に音声識別部３１３で入力マ
ッチングパタンＩと比較パタンメモリー部３１０中の全
ての比較マッチングパタンＳｎとの類似度を求め、音声
特徴類似度Ｘｎを得る。このパタン毎の音声特徴類似度
Ｘｎの内、その最大値を与える比較マッチングパタンに
与えられたカテゴリ名Ｃを音声識別結果として統合判定
部３１６へ出力する。統合判定部３１６ではカテゴリ名
Ｃをそのまま最終認識結果として出力端子３１７より外
部ホストなどに送出する。Next, recognition processing in the noise adaptive processing mode will be described. First, the adaptive noise elimination unit 318 performs adaptive noise elimination filtering from the input signal of the main input system and the input signal of the noise input system to generate a noise-eliminated audio signal. Thereafter, speech recognition is performed on the noise-removed speech signal in the same manner as in the first embodiment. That is, the noise-removed voice signal is converted into a time series of feature vectors representing the features in the voice analysis unit 304. The speech section detection unit 307 determines a speech section based on the feature vector from the speech analysis unit 304, and generates a feature vector sequence of that portion, that is, an input matching pattern I. Next, the speech identification unit 313 obtains the similarity between the input matching pattern I and all the comparison matching patterns Sn in the comparison pattern memory unit 310, and obtains the speech feature similarity Xn. Among the voice feature similarities Xn for each pattern, the category name C given to the comparison matching pattern giving the maximum value is output to the integration determination unit 316 as a voice identification result. The integration determination unit 316 sends the category name C as it is as a final recognition result from the output terminal 317 to an external host or the like.

【００３０】更に認識結果の確認処理後、ホストからの
認識結果の正誤情報を受け取る。この時認識結果が誤り
であれば、主入力系、ノイズ入力系の選定が雑音環境の
変化などにより不適切となったと考えられるので、未雑
音適応処理モードに戻り、主入力系、ノイズ入力系の選
定をやり直す。逆に、認識結果が正しければ全ての設定
は適切であると考えられるので雑音適応処理モードでの
認識を続行する。なお、この認識結果による処理モード
の選択は一例であり、例えば使用者が処理モードを選択
するようにしても良い。また誤認識の頻度で処理モード
変更の判定を行う方法も考えられる。図５は上述の第３
の実施例の動作フローについて示したものである。After the recognition result is confirmed, correct / incorrect information of the recognition result is received from the host. If the recognition result is incorrect at this time, it is considered that the selection of the main input system and the noise input system has become inappropriate due to a change in the noise environment or the like. Redo the selection. Conversely, if the recognition result is correct, all the settings are considered to be appropriate, and the recognition in the noise adaptive processing mode is continued. Note that the selection of the processing mode based on the recognition result is an example. For example, the user may select the processing mode. Further, a method of determining the change of the processing mode based on the frequency of erroneous recognition is also conceivable. FIG.
9 shows the operation flow of the embodiment.

【００３１】尚適応ノイズ除去フィルタリング処理につ
いては例えば日本音響学会誌４５巻２号（１９８９）の
講座「マイクロホン系におけるディジタルフィルタの応
用」及びそこに記載のある参考文献などで明らかである
からここでは説明を省略する。但し音声区間検出部３０
７で音声始端が検出された時点から音声終端検出時点ま
では適応動作を停止し（図５参照）、適応ノイズ除去部
３１８の適応除去フィルタの係数は固定しておくものと
する。The adaptive noise elimination filtering processing is evident in, for example, the lecture “Application of Digital Filters in Microphone System” in the Journal of the Acoustical Society of Japan, Vol. 45, No. 2 (1989), and references described therein. Description is omitted. However, the voice section detection unit 30
It is assumed that the adaptive operation is stopped (see FIG. 5) from the time when the voice start end is detected to the time when the voice end is detected in step 7, and the coefficient of the adaptive elimination filter of the adaptive noise elimination unit 318 is fixed.

【００３２】[0032]

【発明の効果】以上詳細に説明したように本発明によれ
ば、複数のマイクから同時に収音した入力信号に対して
それぞれ独立に識別動作を行い、この複数の認識結果を
用いて総合的に認識判定を行うため、また第三の実施例
にも示すように未雑音適応処理モードで得られた認識結
果に基づいて適応ノイズ除去フィルタを構成し、雑音適
応処理モードに移行後はノイズ除去音声信号により認識
判定を行うようにしたので、発声器官とマイクとの距
離、方向が大きく変化し、且つ背景雑音環境が大きく変
化するような環境下でも安定した高い認識性能が得られ
るという利点がある。例えば、自動車内、オフィス、工
場、街頭等、騒音源の位置、大きさが不規則に変化する
ような場所で音声認識装置を使用する場合、本発明によ
る適応動作を行うことで認識性能を著しく向上すること
ができる。As described above in detail, according to the present invention, an identification operation is performed independently on input signals picked up from a plurality of microphones at the same time, and comprehensively using the plurality of recognition results. An adaptive noise elimination filter is configured based on the recognition result obtained in the non-noise adaptive processing mode for performing the recognition determination and as shown in the third embodiment. Since the recognition judgment is performed based on the signal, there is an advantage that a stable and high recognition performance can be obtained even in an environment in which the distance and direction between the vocal organ and the microphone are largely changed, and the background noise environment is largely changed. . For example, when the voice recognition device is used in a place where the position and size of a noise source change irregularly, such as in a car, an office, a factory, or a street, the recognition performance is significantly improved by performing the adaptive operation according to the present invention. Can be improved.

[Brief description of the drawings]

【図１】本発明の第一の実施例を示すブロック図であ
る。FIG. 1 is a block diagram showing a first embodiment of the present invention.

【図２】本発明の第二の実施例を示すブロック図であ
る。FIG. 2 is a block diagram showing a second embodiment of the present invention.

【図３】本発明の第三の実施例を示すブロック図であ
る。FIG. 3 is a block diagram showing a third embodiment of the present invention.

【図４】従来の認識装置のブロック図である。FIG. 4 is a block diagram of a conventional recognition device.

【図５】第三の実施例の動作フローを示す図である。FIG. 5 is a diagram showing an operation flow of a third embodiment.

[Explanation of symbols]

１０１第１チャンネルの音声入力端子１０２第２チャンネルの音声入力端子１０３第３チャンネルの音声入力端子１０４第１チャンネルの音声分析部１０５第２チャンネルの音声分析部１０６第３チャンネルの音声分析部１０７第１チャンネルの音声区間検出部１０８第２チャンネルの音声区間検出部１０９第３チャンネルの音声区間検出部１１０第１チャンネルの比較パタンメモリー部１１１第２チャンネルの比較パタンメモリー部１１２第３チャンネルの比較パタンメモリー部１１３第１チャンネルの音声識別部１１４第２チャンネルの音声識別部１１５第３チャンネルの音声識別部１１６統合判定部１１７出力端子 101 audio input terminal of the first channel 102 audio input terminal of the second channel 103 audio input terminal of the third channel 104 audio analysis unit of the first channel 105 audio analysis unit of the second channel 106 audio analysis unit of the third channel 107 1-channel voice section detection section 108 2nd channel voice section detection section 109 3rd channel voice section detection section 110 1st channel comparison pattern memory section 111 2nd channel comparison pattern memory section 112 3rd channel comparison pattern Memory unit 113 First-channel audio identification unit 114 Second-channel audio identification unit 115 Third-channel audio identification unit 116 Integration determination unit 117 Output terminal

フロントページの続き (56)参考文献特開昭59−23397（ＪＰ，Ａ) 特開平２−178699（ＪＰ，Ａ) 特開昭58−143396（ＪＰ，Ａ) 特開昭58−52696（ＪＰ，Ａ) 特開昭60−166995（ＪＰ，Ａ) 特開昭63−18400（ＪＰ，Ａ) 特開昭61−35496（ＪＰ，Ａ) 特開昭61−35495（ＪＰ，Ａ) 特開平４−240898（ＪＰ，Ａ) 特開平４−240897（ＪＰ，Ａ) 特開平４−199197（ＪＰ，Ａ) 特開平４−273298（ＪＰ，Ａ) 特開平４−212600（ＪＰ，Ａ) 実開昭57−116999（ＪＰ，Ｕ) 実開昭57−69067（ＪＰ，Ｕ) 実公平２−41680（ＪＰ，Ｙ２) 日本音響学会誌，Ｖｏｌ．45，Ｎｏ. ２，金田豊「マイクロホン系におけるディジタルフィルタの応用−不要な音を取り除く技術−」，ｐ．125−128（1989年２月発行) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 Continuation of the front page (56) References JP-A-59-23397 (JP, A) JP-A-2-178699 (JP, A) JP-A-58-143396 (JP, A) JP-A-58-52696 (JP) JP-A-60-16669 (JP, A) JP-A-63-18400 (JP, A) JP-A-61-35496 (JP, A) JP-A-61-35495 (JP, A) 4-240898 (JP, A) JP-A-4-240897 (JP, A) JP-A-4-199197 (JP, A) JP-A-4-273298 (JP, A) JP-A-4-212600 (JP, A A) Japanese Utility Model Application Sho 57-116999 (JP, U) Japanese Utility Model Application Sho 57-69067 (JP, U) Japanese Utility Model Application No. 2-41680 (JP, Y2) Journal of the Acoustical Society of Japan, Vol. 45, No. 2, Yutaka Kaneda “Application of Digital Filter in Microphone System-Technology for Removing Unwanted Sound-”, p. 125-128 (Issued February 1989) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 15/00-17/00

Claims

(57) [Claims]

Claims: 1. A sound is simultaneously picked up from a plurality of microphones,
A process of obtaining a plurality of input signals; a process of obtaining a plurality of identification results by independently performing voice recognition on the plurality of input signals; and a case where a majority of the plurality of identification results match, the matching identification results are And a process of determining as a final identification result.

2. A method according to claim 1, wherein
Based on the recognition result
Main input system with the best signal / background noise ratio,
Processing to determine the noise input system with the worst background noise ratio
And an adaptive noise elimination filter from the main input system and the noise input system.
And a process for forming a filter. In the subsequent recognition process, the input signal is subjected to the adaptive noise elimination.
Sound on the output filtered by the
A voice recognition method characterized by voice recognition.

3. A method for simultaneously collecting sounds from a plurality of microphones,
A process of obtaining a plurality of input signals;
Processing to obtain another result, and among the plurality of identification results, a signal
Processing to determine the one with the maximum noise-to-noise ratio as the identification result
And management, based on the determined recognition result by the determination process, the correct answer
Processing for determining the input system; and when the correct input system is determined, the correct input system
Noise in the said audio signal / background noise ratio is best primary input system in each input system, a process of determining the worst noise input system of the audio signal / background noise ratio, the main input system and the A speech recognition method comprising: constructing an adaptive noise elimination filter from an input system; and performing speech recognition on an output obtained by filtering the input signal by the adaptive noise elimination filter in a subsequent recognition process.