JPH04318900A

JPH04318900A - Multidirectional simultaneous sound collection type voice recognizing method

Info

Publication number: JPH04318900A
Application number: JP3086645A
Authority: JP
Inventors: Takashi Miki; 三木　敬
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1991-04-18
Filing date: 1991-04-18
Publication date: 1992-11-10
Anticipated expiration: 2016-05-08
Also published as: JP3163109B2

Abstract

PURPOSE:To obtain stable recognition performance even in environment wherein the distance and direction between the voicing windpipe and a microphone change and background noise environment changes. CONSTITUTION:Voices which are collected through plural microphones at the same time are inputted from input terminals 101, 102, and 103 and passed through voice analysis parts 104, 105, and 106 and voice section detection part 107, 108, and 109, and comparison pattern memory parts 110, 111, and 112 are referred to, so that voice identification parts 113, 114, and 115 recognize them independently of one another. A total decision part 116 totally decides the results of the independent recognition and identification auxiliary information (identification accuracy, start and end times of voice, and signal-to-noise ratio) and outputs the final recognition result to an output terminal 117.

Description

[Detailed description of the invention]

【０００１】0001

【産業上の利用分野】この発明は多方向からの音声を同
時収音して認識する多方向同時収音式音声認識方法に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multidirectional simultaneous sound recognition method for simultaneously collecting and recognizing sounds from multiple directions.

【０００２】0002

【従来の技術】音声認識装置はコンピューター、その他
の種々の機器の有力な入力手段として利用され始めてい
る。図４は特開昭６２−７３２９８号公報に示されるよ
うな従来の認識装置の典型例のブロック図であり、４０
１は音声入力端子、４０２は音声分析部、４０３は音声
区間検出部、４０４は比較マッチングパタンメモリー部
、４０５は類似度計算部、４０６は判定部である。2. Description of the Related Art Speech recognition devices are beginning to be used as powerful input means for computers and various other devices. FIG. 4 is a block diagram of a typical example of a conventional recognition device as shown in Japanese Unexamined Patent Publication No. 62-73298.
1 is a voice input terminal, 402 is a voice analysis section, 403 is a voice section detection section, 404 is a comparison matching pattern memory section, 405 is a similarity calculation section, and 406 is a determination section.

【０００３】音声入力端子４０１から入力された音声信
号は音声分析部４０２において音声特徴を表す特徴ベク
トルの時系列に変換される。[0003] An audio signal input from an audio input terminal 401 is converted into a time series of feature vectors representing audio features in an audio analysis section 402 .

【０００４】音声区間検出部４０３では音声分析部４０
２からの特徴ベクトルに基づいて音声の存在する区間、
即ち音声区間を決定する。この音声始端から音声終端ま
での特徴ベクトル系列を音声マッチングパタンと呼ぶ。[0004] In the speech section detection section 403, the speech analysis section 40
2. Based on the feature vector from 2, the section where the voice exists,
That is, the voice section is determined. This feature vector sequence from the beginning of the voice to the end of the voice is called a voice matching pattern.

【０００５】次に比較マッチングパタンメモリー部４０
４の機能について説明する。話者を限定する特定話者音
声認識では、認識対象となる単語（カテゴリと称する）
を予め発声し（学習音声と称する）、同一の音声分析を
施して求められた音声マッチングパタン（比較マッチン
グパタンと称する）を格納しておく必要がある。比較パ
タンメモリー部４０４にはこのような比較マッチングパ
タンが格納されている。この比較マッチングパタンの格
納動作を登録処理と呼ぶ。又話者を限定しない音声認識
（不特定話者音声認識と称する）の場合には、種々の話
者の比較マッチングパタンが比較パタンメモリー部４０
４に予め格納されている。Next, a comparison matching pattern memory section 40
Function 4 will be explained. In speaker-specific speech recognition that limits speakers, words to be recognized (referred to as categories)
It is necessary to utter in advance (referred to as a learning voice) and store a voice matching pattern (referred to as a comparison matching pattern) obtained by performing the same voice analysis. The comparison pattern memory section 404 stores such comparison matching patterns. This comparison matching pattern storage operation is called registration processing. In the case of speech recognition that does not limit speakers (referred to as speaker-independent speech recognition), comparison matching patterns of various speakers are stored in the comparison pattern memory section 40.
4 is stored in advance.

【０００６】類似度計算部４０５では認識させようとす
る入力音声から生成された入力マッチングパタンと比較
マッチングパタンとの類似度計算を行う。類似度計算は
公知の手法であるＤＰマッチングや特開昭６２−７３２
９９号公報に示されるような簡便な線形マッチングをは
じめとして種々の方法が提案されており何れかの適切な
方法で類似度計算が行われる。[0006] The similarity calculation unit 405 calculates the similarity between the input matching pattern generated from the input speech to be recognized and the comparison matching pattern. Similarity calculations are performed using well-known methods such as DP matching and Japanese Patent Application Laid-Open No. 62-732.
Various methods have been proposed, including a simple linear matching method as shown in Japanese Patent No. 99, and similarity calculation is performed using any suitable method.

【０００７】この類似度計算部４０５から出力されるカ
テゴリ毎の類似度を用いて、判定部４０６ではその最大
類似度を与える比較マッチングパタンに与えられたカテ
ゴリ名を認識結果として出力する。Using the degree of similarity for each category outputted from the similarity degree calculation section 405, the determination section 406 outputs the category name given to the comparison matching pattern that gives the maximum degree of similarity as a recognition result.

【０００８】[0008]

【発明が解決しようとする課題】上述した従来の認識装
置では、１本のマイクから音声を収音しているために、
接話形マイクのような発声器官とマイクとの距離、方向
が一定している入力形態では、常にその音声認識装置が
持つ認識能力が最大限発揮出来ていた。しかしながら、
発声器官とマイクとの距離、方向が大きく変化するよう
な入力形態では、認識性能が極端に低下する事が多々有
った。更に接話形マイクを使用した場合でも、周囲の背
景雑源が大きく変化するような環境では、やはり認識性
能が安定せず、誤動作や誤認識が起こる場合があった。この発明の目的は、発声器官とマイクとの距離、方向が
大きく変化し、且つ背景雑音環境が大きく変化するよう
な環境下でも安定した高い認識性能が得られる音声認識
方法を提供するものである。[Problems to be Solved by the Invention] In the conventional recognition device described above, since the sound is collected from one microphone,
In an input mode such as a close-talk type microphone in which the distance and direction between the vocal organ and the microphone are constant, the recognition ability of the speech recognition device is always maximized. however,
In input formats in which the distance and direction between the vocal organ and the microphone change significantly, recognition performance often deteriorates dramatically. Furthermore, even when a close-talking microphone is used, recognition performance is still unstable in an environment where the surrounding background noise changes significantly, and malfunctions and erroneous recognition may occur. An object of the present invention is to provide a speech recognition method that can obtain stable and high recognition performance even under environments where the distance and direction between the vocal organ and the microphone change greatly, and the background noise environment changes greatly. .

【０００９】[0009]

【課題を解決するための手段】この目的の達成を図るた
めの本音声認識方法は、第一の構成として、複数のマイ
クから同時に音声を収音し、複数の入力信号を得る処理
と、前記複数の入力信号を各々独立に音声識別して複数
の識別結果を得る処理と、前記複数の識別結果を統合判
定する処理とを備えたことを特徴とする。[Means for Solving the Problems] The present speech recognition method for achieving this object includes, as a first configuration, a process of simultaneously collecting sounds from a plurality of microphones to obtain a plurality of input signals; The present invention is characterized by comprising a process of independently voice-identifying a plurality of input signals to obtain a plurality of identification results, and a process of performing an integrated judgment on the plurality of identification results.

【００１０】また本音声認識方法は、第二の構成として
、前記音声認識方法により確定した認識結果に基づいて
、各入力系における当該音声信号／背景雑音比が最も良
い主入力系と、当該音声信号／背景雑音比の最も悪いノ
イズ入力系とを決定する処理と、主入力系とノイズ入力
系より適応ノイズ除去フィルタを構成する処理とを備え
、以降の認識処理では前記適応ノイズ除去フィルタリン
グ後の入力信号より音声識別することを特徴とする。[0010] Furthermore, the present speech recognition method has a second configuration in which the main input system with the best speech signal/background noise ratio in each input system and the speech signal are selected based on the recognition results determined by the speech recognition method. It includes a process of determining the noise input system with the worst signal/background noise ratio, and a process of configuring an adaptive noise removal filter from the main input system and the noise input system. It is characterized by identifying voices from input signals.

【００１１】[0011]

【作用】第一の構成は、複数のマイクから同時に収音し
た複数の入力信号に対してそれぞれ独立に識別動作を行
い、この複数の識別結果及び認識補助情報を統合判定部
で総合的に判定することにより認識判定を行うものであ
り、また第二の構成は第一の構成による認識判定により
確定した認識結果に基づいて適応ノイズ除去フィルタを
構成し、以降はこのフィルタリング処理後の入力信号よ
り認識判定を行うもので、上記構成によれば、雑音発声
源の位置を考慮して最も信号・雑音比の良い入力信号に
より音声認識を行い得るものである。従って、本発明の
音声認識方法を用いれば発声器官とマイクとの距離、方
向が大きく変化し、且つ背景雑音環境が大きく変化する
ような環境下でも安定した高い認識性能が得られる音声
認識装置を実現出来る。[Operation] The first configuration performs identification operations independently for multiple input signals picked up simultaneously from multiple microphones, and comprehensively determines the multiple identification results and recognition auxiliary information by the integrated determination unit. The second configuration configures an adaptive noise removal filter based on the recognition result determined by the recognition determination using the first configuration. According to the above configuration, speech recognition can be performed using an input signal with the best signal-to-noise ratio in consideration of the position of the noise source. Therefore, by using the speech recognition method of the present invention, it is possible to create a speech recognition device that can obtain stable and high recognition performance even in environments where the distance and direction between the vocal organ and the microphone change greatly, and where the background noise environment changes greatly. It can be achieved.

【００１２】0012

【実施例】以下、本発明の実施例について述べる。なお
、ここでは音声マッチングパタンとは、入力マッチング
パタンと比較マッチングパタンとに共通した生成過程で
作られるパタンを意味している。[Examples] Examples of the present invention will be described below. Note that the speech matching pattern here means a pattern created in a common generation process for the input matching pattern and the comparison matching pattern.

【００１３】図１は登録動作を行う特定話者認識方式に
適用した本発明の第一の実施例を示すブロック図であっ
て、１０１，１０２，１０３はそれぞれ第１チャンネル
、第２チャンネル、第３チャンネルの音声入力端子、１
０４，１０５，１０６はそれぞれ第１チャンネル、第２
チャンネル、第３チャンネルの音声分析部、１０７，１
０８，１０９はそれぞれ第１チャンネル、第２チャンネ
ル、第３チャンネルの音声区間検出部、１１０，１１１
，１１２はそれぞれ第１チャンネル、第２チャンネル、
第３チャンネルの比較パタンメモリー部、１１３，１１
４，１１５はそれぞれ第１チャンネル、第２チャンネル
、第３チャンネルの音声識別部、１１６は統合判定部、
１１７は出力端子である。FIG. 1 is a block diagram showing a first embodiment of the present invention applied to a specific speaker recognition system that performs a registration operation. 3 channel audio input terminal, 1
04, 105, 106 are the first channel and the second channel, respectively.
Channel, 3rd channel audio analysis section, 107,1
08 and 109 are audio section detection units for the first channel, second channel, and third channel, respectively; 110 and 111;
, 112 are the first channel, the second channel,
3rd channel comparison pattern memory section, 113, 11
4 and 115 are audio identification units for the first channel, second channel, and third channel, respectively; 116 is an integrated determination unit;
117 is an output terminal.

【００１４】尚第一の実施例では説明を簡略化するため
に音声入力信号として３回線の例を示しているが、更に
多くの回線数を設けても何等差し支えない。又ここで１
０１，１０４，１０７，１１０，１１３の各機能ブロッ
クをまとめて第１チャンネル音声処理部と呼ぶ。同様に
１０２，１０５，１０８，１１１，１１４の各機能ブロ
ックをまとめて第２チャンネル音声処理部と呼び、１０
３，１０６，１０９，１１２，１１５の各機能ブロック
をまとめて第３チャンネル音声処理部と呼ぶ。Although the first embodiment shows an example of three lines as the audio input signal to simplify the explanation, there is no problem in providing a larger number of lines. Again here 1
The functional blocks 01, 104, 107, 110, and 113 are collectively referred to as a first channel audio processing section. Similarly, the functional blocks 102, 105, 108, 111, and 114 are collectively called the second channel audio processing section.
The functional blocks 3, 106, 109, 112, and 115 are collectively referred to as a third channel audio processing section.

【００１５】先ず第１チャンネルに入力された音声信号
の識別処理を説明する。音声入力端子１０１から入力さ
れた音声信号は音声分析部１０４において音声の特徴を
表す特徴ベクトルの時系列に変換される。特徴ベクトル
の導出方法には、中心周波数が少しずつ異なる複数のバ
ンドパス群を用いる方法や、ＦＦＴ（高速フーリエ変換
）よるスペクトル分析を用いるもの等々考えられるが、
ここではバンドパスフィルタ群を使用する方法を例に挙
げる。First, the identification process of the audio signal input to the first channel will be explained. An audio signal input from the audio input terminal 101 is converted into a time series of feature vectors representing the characteristics of the audio in the audio analysis unit 104. Possible methods for deriving feature vectors include using multiple bandpass groups with slightly different center frequencies, and using spectral analysis using FFT (fast Fourier transform).
Here, a method using a group of bandpass filters will be exemplified.

【００１６】音声信号は音声分析部１０４にてアナログ
・デジタル変換された後、各バンドパスフィルタによっ
てその周波数成分のみを抽出する。この様にして各バン
ドパスフィルタによって振り分けられたデータの系列を
チャネルと称する。各チャネル毎のフィルタの出力に対
して整流して絶対値を取りフレーム単位でその平均値を
算出する。この算出値がそのフレームにおける各チャネ
ルの特徴ベクトルの大きさになる。即ち、ｉ番目のフレ
ームにおける特徴ベクトルの大きさＡｉｊは、Ａｉｊ＝
（　　Ａｉ１，　　Ａｉ２，・・・，Ａｉｐ）となる。ここでｐはチャネル数である。After the audio signal is analog-to-digital converted by the audio analysis section 104, only its frequency components are extracted by each bandpass filter. A series of data distributed by each bandpass filter in this manner is called a channel. The output of the filter for each channel is rectified, the absolute value is taken, and the average value is calculated for each frame. This calculated value becomes the size of the feature vector of each channel in that frame. That is, the size Aij of the feature vector in the i-th frame is Aij=
(Ai1, Ai2, ..., Aip). Here p is the number of channels.

【００１７】音声区間検出部１０７では音声分析部１０
４からの特徴ベクトルに基づいて音声の存在する区間、
即ち音声区間を決定する。In the speech section detection section 107, the speech analysis section 10
4. Based on the feature vector from 4, the section where the voice exists,
That is, the voice section is determined.

【００１８】比較パタンメモリー部１１０には認識対象
カテゴリの比較マッチングパタンが格納されており、こ
の比較マッチングパタンの格納動作を登録処理と呼ぶ。ここでは登録処理の例として、説明の簡単化のため１カ
テゴリ当り１回の学習音声を発声する場合を取り上げる
。カテゴリの総数をＮとした場合、比較パタンメモリに
はＮ個の比較マッチングパタンＳｌｎ（ｎ＝１，２，・
・・，Ｎ）が格納される。Comparison matching patterns of categories to be recognized are stored in the comparison pattern memory section 110, and the operation of storing the comparison matching patterns is called registration processing. Here, as an example of the registration process, for the purpose of simplifying the explanation, we will take up a case where the learning voice is uttered once per category. When the total number of categories is N, the comparison pattern memory contains N comparison matching patterns Sln (n=1, 2, .
..., N) are stored.

【００１９】音声のマッチングパタン同士の比較では、
両者の時間的な対応をとる必要がある。最適な対応をと
りながら両者のパタン間の類似度を算出する代表的な方
法に特公昭５０−２３９４１号に示されている様な通称
ＤＰマッチング方法がある。音声識別部１１３では、こ
の様なＤＰマッチング法もしくはその他の好適な方法を
用いてパタン間の類似度計算を行う。即ち、認識させよ
うとする入力音声から生成された入力マッチングパタン
Ｉ１と比較パタンメモリー部１１０中の全ての比較マッ
チングパタンＳｌｎとの類似度を求め、音声特徴類似度
Ｘｌｎを得る。In comparing audio matching patterns,
It is necessary to accommodate both parties in terms of time. A typical method for calculating the similarity between two patterns while taking an optimal correspondence is the so-called DP matching method as shown in Japanese Patent Publication No. 50-23941. The speech identification unit 113 calculates the similarity between patterns using such a DP matching method or other suitable method. That is, the degree of similarity between the input matching pattern I1 generated from the input voice to be recognized and all the comparison matching patterns Sln in the comparison pattern memory section 110 is determined to obtain the degree of voice feature similarity Xln.

【００２０】このパタン毎の音声特徴類似度Ｘｌｎの内
、その最大値Ｐ１を与える比較マッチングパタンに与え
られたカテゴリ名Ｃ１をそのチャンネルにおける音声識
別結果として統合判定部１１６へ出力する。更に識別補
助情報として、音声特徴類似度の最大値Ｐ１（識別確度
と称する）、音声の開始終了時刻ＶＳ１，ＶＥ１、音声
レベル（音声区間の信号レベル）と雑音レベル（音声区
間直前の無入力状態での信号レベル）の比ＳＮ１（信号
／雑音レシオと称する）等も統合判定部１１６へ出力す
る。Among the voice feature similarities Xln for each pattern, the category name C1 given to the comparison matching pattern that gives the maximum value P1 is output to the integrated determination section 116 as the voice identification result for that channel. Furthermore, as identification auxiliary information, the maximum value P1 of voice feature similarity (referred to as identification accuracy), voice start and end times VS1, VE1, voice level (signal level of voice section) and noise level (no input state immediately before voice section) The ratio SN1 (referred to as signal/noise ratio) of the signal level at

【００２１】第２チャンネルに入力された音声信号の識
別処理も上述した第１チャンネルの場合と全く同様にし
て処理を行う。即ち音声入力端子１０２から入力された
音声信号は音声分析部１０５において音声特徴を表す特
徴ベクトルの時系列に変換される。音声区間検出部１０
８では音声分析部１０５からの特徴ベクトルに基づいて
音声区間を決定し、その部分の特徴ベクトル系列、即ち
入力マッチングパタンＩ２が生成される。次に音声識別
部１１４で入力マッチングパタンＩ２と比較パタンメモ
リー部１１１中の全ての比較マッチングパタンＳ２ｎと
の類似度を求め、音声特徴類似度Ｘ２ｎを得る。The identification process for the audio signal input to the second channel is performed in exactly the same manner as in the case of the first channel described above. That is, the audio signal input from the audio input terminal 102 is converted into a time series of feature vectors representing audio features in the audio analysis section 105. Voice section detection unit 10
In step 8, a speech section is determined based on the feature vectors from the speech analysis section 105, and a feature vector series for that part, that is, an input matching pattern I2 is generated. Next, the voice identification section 114 determines the degree of similarity between the input matching pattern I2 and all comparison matching patterns S2n in the comparison pattern memory section 111, and obtains the degree of voice feature similarity X2n.

【００２２】このパタン毎の音声特徴類似度Ｘ２ｎの内
、その最大値を与える比較マッチングパタンに与えられ
たカテゴリ名Ｃ２をそのチャンネルにおける音声識別結
果として統合判定部１１６へ出力する。更に識別補助情
報として、識別確度Ｐ２、音声の開始終了時刻ＶＳ２，
ＶＥ２、信号／雑音レシオＳＮ２等も統合判定部１１６
へ出力する。Of the voice feature similarities X2n for each pattern, the category name C2 given to the comparison matching pattern that gives the maximum value is output to the integrated determination section 116 as the voice identification result for that channel. Furthermore, as identification auxiliary information, identification accuracy P2, audio start and end time VS2,
VE2, signal/noise ratio SN2, etc. are also integrated in the judgment unit 116.
Output to.

【００２３】第３チャンネルに入力された音声信号の識
別処理も上述した第１チャンネルの場合と全く同様に行
われ、統合判定部１１６へ音声識別結果Ｃ３、と識別補
助情報（識別確度Ｐ３、音声の開始終了時刻ＶＳ３，Ｖ
Ｅ３、信号／雑音レシオＳＮ３等）が出力される。The identification process for the audio signal input to the third channel is performed in exactly the same way as for the first channel described above, and the integrated determination unit 116 is sent the audio identification result C3 and the identification auxiliary information (identification accuracy P3, audio start and end time VS3,V
E3, signal/noise ratio SN3, etc.) are output.

【００２４】統合判定部１１６では各チャンネルの音声
処理部から出力された音声識別結果、及び識別補助情報
を総合して判定し、最終認識結果を出力端子１１７より
外部ホストなどに送出する。この判定方法については種
々考えられるが、音声始端、終端時刻がほぼ同じである
識別情報群について、その過半数の識別結果が一致した
場合のみ、その識別結果を有効とする方法や、識別確度
Ｐが最も高い識別結果を有効と定める方法、又信号／雑
音レシオＳＮが最も高いチャンネルの識別結果を有効と
する方法等が挙げられる。上述した判定法もしくはその
他の好適な方法を用いて最終判定結果を算出すればよい
。[0024] The integrated determination unit 116 performs a comprehensive determination on the voice recognition results output from the voice processing units of each channel and the identification auxiliary information, and sends the final recognition result to an external host or the like from an output terminal 117. Various methods can be considered for this determination, such as a method in which the identification results are valid only when a majority of the identification results match for a group of identification information with almost the same voice start and end times, and a method in which the identification accuracy P is Examples include a method in which the highest identification result is determined to be valid, and a method in which the identification result of the channel with the highest signal/noise ratio SN is determined to be valid. The final determination result may be calculated using the determination method described above or other suitable method.

【００２５】図２は本発明の第二の実施例を示すブロッ
ク図である。第二の実施例では３つの音声入力信号２０
１，２０２，２０３に対して音声分析部２０４、音声区
間検出部２０７、音声識別部２１３を時分割に動作させ
てそれぞれ独立に音声識別を行い、音声識別結果Ｃ１，
Ｃ２，Ｃ３と識別補助情報（識別確度Ｐ１，Ｐ２，Ｐ３
、音声の開始時刻ＶＳ１，ＶＳ２，ＶＳ３、音声の終了
時刻ＶＥ１，ＶＥ２，ＶＥ３、信号／雑音レシオＳＮ１
，ＳＮ２，ＳＮ３）を求める。また比較パタンメモリー
部２１０は共通とする。総合判定部２１６の処理は第一
の実施例と同様であり、最終認識結果を出力端子２１７
より外部ホストなどに送出する。FIG. 2 is a block diagram showing a second embodiment of the present invention. In the second embodiment, three audio input signals 20
1, 202, and 203, the speech analysis section 204, speech section detection section 207, and speech identification section 213 are operated in a time-sharing manner to independently perform speech identification, and the speech identification results C1,
C2, C3 and identification auxiliary information (identification accuracy P1, P2, P3
, audio start time VS1, VS2, VS3, audio end time VE1, VE2, VE3, signal/noise ratio SN1
, SN2, SN3). Further, the comparison pattern memory section 210 is common. The processing of the comprehensive judgment unit 216 is the same as that of the first embodiment, and the final recognition result is sent to the output terminal 217.
Send it to an external host, etc.

【００２６】図３は本発明の第三の実施例を示すブロッ
ク図である。第三の実施例の動作は雑音適応処理モード
と未雑音適応処理モードに分けられる。未雑音適応処理
モードでは適応ノイズ除去部３１８を外した状態となり
、認識処理の動作は第二の実施例と全く同様である。雑音適応処理モードには未雑音適応処理モードで音声認
識を実行した後移行する。雑音適応処理モードでは認識
に先立って以下の処理を行う。FIG. 3 is a block diagram showing a third embodiment of the present invention. The operation of the third embodiment is divided into a noise adaptive processing mode and a non-noise adaptive processing mode. In the no-noise adaptive processing mode, the adaptive noise removal section 318 is removed, and the recognition processing operation is exactly the same as in the second embodiment. The noise adaptive processing mode is entered after speech recognition is performed in the non-noise adaptive processing mode. In the noise adaptive processing mode, the following processing is performed prior to recognition.

【００２７】直前の認識時の音声識別結果Ｃ１，Ｃ２，
Ｃ３と識別補助情報（識別確度Ｐ１，Ｐ２，Ｐ３、音声
の開始時刻ＶＳ１，ＶＳ２，ＶＳ３、音声の終了時刻Ｖ
Ｅ１，ＶＥ２，ＶＥ３、信号／雑音レシオＳＮ１，ＳＮ
２，ＳＮ３）及び最終確認後の認識結果Ｃｒ（外部ホス
トから送出される）から主入力系、ノイズ入力系を決定
する。決定アルゴリズムは以下の通りである。[0027] Speech identification results C1, C2, at the time of immediately preceding recognition,
C3 and identification auxiliary information (identification accuracy P1, P2, P3, audio start time VS1, VS2, VS3, audio end time V
E1, VE2, VE3, signal/noise ratio SN1, SN
2, SN3) and the recognition result Cr after final confirmation (sent from the external host), the main input system and noise input system are determined. The decision algorithm is as follows.

【００２８】（１）最終確認後の認識結果Ｃｒと一致す
る音声識別結果が出力された入力系を正解入力系として
選択する。もしそのような入力系がなければ雑音適応モ
ードへは移行せず次回の認識処理も未雑音適応モードで
行う。（２）正解入力系中で最も信号／雑音レシオＳＮが高い
入力系を主入力系とする。（３）主入力系以外の入力系の内で最も信号／雑音レシ
オＳＮが低い入力系をノイズ入力系とする。(1) The input system that outputs the voice recognition result that matches the recognition result Cr after final confirmation is selected as the correct input system. If there is no such input system, the next recognition process will be performed in the non-noise adaptive mode without shifting to the noise adaptive mode. (2) The input system with the highest signal/noise ratio SN among the correct input systems is set as the main input system. (3) Among the input systems other than the main input system, the input system with the lowest signal/noise ratio SN is set as the noise input system.

【００２９】次に雑音適応処理モードでの認識処理に付
いて説明する。先ず適応ノイズ除去部３１８で主入力系
の入力信号とノイズ入力系の入力信号より適応ノイズ除
去フィルタリングを行いノイズ除去音声信号を生成する
。この後はノイズ除去音声信号に対して第一の実施例と
同様に音声識別を行う。即ちノイズ除去音声信号は音声
分析部３０４において特徴を表す特徴ベクトルの時系列
に変換される。音声区間検出部３０７では音声分析部３
０４からの特徴ベクトルに基づいて音声区間を決定し、
その部分の特徴ベクトル系列、即ち入力マッチングパタ
ンＩが生成される。次に音声識別部３１３で入力マッチ
ングパタンＩと比較パタンメモリー部３１０中の全ての
比較マッチングパタンＳｎとの類似度を求め、音声特徴
類似度Ｘｎを得る。このパタン毎の音声特徴類似度Ｘｎ
の内、その最大値を与える比較マッチングパタンに与え
られたカテゴリ名Ｃを音声識別結果として統合判定部３
１６へ出力する。統合判定部３１６ではカテゴリ名Ｃを
そのまま最終認識結果として出力端子３１７より外部ホ
ストなどに送出する。Next, recognition processing in the noise adaptive processing mode will be explained. First, the adaptive noise removal section 318 performs adaptive noise removal filtering on the input signal of the main input system and the input signal of the noise input system to generate a noise removed audio signal. After this, voice identification is performed on the noise-removed voice signal in the same manner as in the first embodiment. That is, the noise-removed audio signal is converted by the audio analysis unit 304 into a time series of feature vectors representing features. In the speech section detection section 307, the speech analysis section 3
Determine the speech interval based on the feature vector from 04,
A feature vector series for that part, ie, an input matching pattern I, is generated. Next, the voice identification section 313 determines the degree of similarity between the input matching pattern I and all comparison matching patterns Sn in the comparison pattern memory section 310 to obtain the degree of voice feature similarity Xn. Voice feature similarity for each pattern Xn
The integrated determination unit 3 uses the category name C given to the comparison matching pattern that gives the maximum value as the voice recognition result.
Output to 16. The integrated determination unit 316 sends the category name C as it is to an external host or the like from the output terminal 317 as the final recognition result.

【００３０】更に認識結果の確認処理後、ホストからの
認識結果の正誤情報を受け取る。この時認識結果が誤り
であれば、主入力系、ノイズ入力系の選定が雑音環境の
変化などにより不適切となったと考えられるので、未雑
音適応処理モードに戻り、主入力系、ノイズ入力系の選
定をやり直す。逆に、認識結果が正しければ全ての設定
は適切であると考えられるので雑音適応処理モードでの
認識を続行する。なお、この認識結果による処理モード
の選択は一例であり、例えば使用者が処理モードを選択
するようにしても良い。また誤認識の頻度で処理モード
変更の判定を行う方法も考えられる。図５は上述の第３
の実施例の動作フローについて示したものである。Furthermore, after the recognition result is confirmed, information on whether the recognition result is correct or incorrect is received from the host. If the recognition result is incorrect at this time, it is considered that the selection of the main input system and noise input system has become inappropriate due to changes in the noise environment, so the main input system and noise input system are Redo the selection. On the other hand, if the recognition result is correct, all settings are considered appropriate, and recognition in the noise adaptive processing mode is continued. Note that the selection of the processing mode based on the recognition result is just one example, and the user may select the processing mode, for example. Another possible method is to determine whether to change the processing mode based on the frequency of misrecognitions. Figure 5 shows the third
This figure shows the operational flow of the embodiment.

【００３１】尚適応ノイズ除去フィルタリング処理につ
いては例えば日本音響学会誌４５巻２号（１９８９）の
講座「マイクロホン系におけるディジタルフィルタの応
用」及びそこに記載のある参考文献などで明らかである
からここでは説明を省略する。但し音声区間検出部３０
７で音声始端が検出された時点から音声終端検出時点ま
では適応動作を停止し（図５参照）、適応ノイズ除去部
３１８の適応除去フィルタの係数は固定しておくものと
する。[0031] The adaptive noise removal filtering process is clear, for example, in the course ``Applications of digital filters in microphone systems'' in the Journal of the Acoustical Society of Japan, Vol. 45, No. 2 (1989), and the references therein, so it will not be described here. The explanation will be omitted. However, the voice section detection unit 30
It is assumed that the adaptive operation is stopped from the time when the voice start point is detected in step 7 until the voice end point is detected (see FIG. 5), and the coefficients of the adaptive removal filter of the adaptive noise removal section 318 are fixed.

【００３２】[0032]

【発明の効果】以上詳細に説明したように本発明によれ
ば、複数のマイクから同時に収音した入力信号に対して
それぞれ独立に識別動作を行い、この複数の認識結果を
用いて総合的に認識判定を行うため、また第三の実施例
にも示すように未雑音適応処理モードで得られた認識結
果に基づいて適応ノイズ除去フィルタを構成し、雑音適
応処理モードに移行後はノイズ除去音声信号により認識
判定を行うようにしたので、発声器官とマイクとの距離
、方向が大きく変化し、且つ背景雑音環境が大きく変化
するような環境下でも安定した高い認識性能が得られる
という利点がある。例えば、自動車内、オフィス、工場
、街頭等、騒音源の位置、大きさが不規則に変化するよ
うな場所で音声認識装置を使用する場合、本発明による
適応動作を行うことで認識性能を著しく向上することが
できる。[Effects of the Invention] As explained in detail above, according to the present invention, the input signals picked up simultaneously from a plurality of microphones are each independently discriminated, and the plurality of recognition results are used to comprehensively perform the discrimination operation. In order to perform recognition judgment, and as shown in the third embodiment, an adaptive noise removal filter is configured based on the recognition result obtained in the no-noise adaptive processing mode, and after shifting to the noise-adaptive processing mode, the noise-removed voice is Since recognition is determined based on signals, the advantage is that stable and high recognition performance can be obtained even in environments where the distance and direction between the vocal organs and the microphone change significantly, and where the background noise environment changes significantly. . For example, when using a speech recognition device in a place where the position and size of the noise source change irregularly, such as in a car, office, factory, or on the street, the adaptive operation according to the present invention can significantly improve recognition performance. can be improved.

[Brief explanation of drawings]

【図１】本発明の第一の実施例を示すブロック図である
。FIG. 1 is a block diagram showing a first embodiment of the present invention.

【図２】本発明の第二の実施例を示すブロック図である
。FIG. 2 is a block diagram showing a second embodiment of the invention.

【図３】本発明の第三の実施例を示すブロック図である
。FIG. 3 is a block diagram showing a third embodiment of the present invention.

【図４】従来の認識装置のブロック図である。FIG. 4 is a block diagram of a conventional recognition device.

【図５】第三の実施例の動作フローを示す図である。FIG. 5 is a diagram showing an operation flow of a third embodiment.

[Explanation of symbols]

１０１　　　　第１チャンネルの音声入力端子１０２　
　　　第２チャンネルの音声入力端子１０３　　　　第
３チャンネルの音声入力端子１０４　　　　第１チャン
ネルの音声分析部１０５　　　　第２チャンネルの音声
分析部１０６　　　　第３チャンネルの音声分析部１０
７　　　　第１チャンネルの音声区間検出部１０８　　
　　第２チャンネルの音声区間検出部１０９　　　　第
３チャンネルの音声区間検出部１１０　　　　第１チャ
ンネルの比較パタンメモリー部１１１　　　　第２チャ
ンネルの比較パタンメモリー部１１２　　　　第３チャ
ンネルの比較パタンメモリー部１１３　　　　第１チャ
ンネルの音声識別部１１４　　　　第２チャンネルの音
声識別部１１５　　　　第３チャンネルの音声識別部１
１６　　　　統合判定部１１７　　　　出力端子101 First channel audio input terminal 102
2nd channel audio input terminal 103 3rd channel audio input terminal 104 1st channel audio analysis section 105 2nd channel audio analysis section 106 3rd channel audio analysis section 10
7 First channel audio section detection unit 108
2nd channel voice section detection section 109 3rd channel voice section detection section 110 1st channel comparison pattern memory section 111 2nd channel comparison pattern memory section 112 3rd channel comparison pattern memory section 113 1st channel voice Identification unit 114 Second channel audio identification unit 115 Third channel audio identification unit 1
16 Integrated determination unit 117 Output terminal

Claims

[Claims]

1. A process of simultaneously collecting sounds from a plurality of microphones to obtain a plurality of input signals, and a process of independently identifying each of the plurality of input signals to obtain a plurality of identification results.
A speech recognition method comprising: a process of performing an integrated judgment on the plurality of identification results.

2. Based on the recognition results determined by the speech recognition method according to claim 1, the main input system with the best speech signal/background noise ratio in each input system, and the main input system with the best speech signal/background noise ratio in each input system. It includes a process of determining a bad noise input system and a process of configuring an adaptive noise removal filter from the main input system and the noise input system, and in the subsequent recognition process, speech is identified from the input signal after the adaptive noise removal filtering. A speech recognition method characterized by: