JP3530035B2

JP3530035B2 - Sound recognition device

Info

Publication number: JP3530035B2
Application number: JP23256798A
Authority: JP
Inventors: 弘行松井; 智大高野; 真理子青木; 学岡本; 茂明青木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1998-08-19
Filing date: 1998-08-19
Publication date: 2004-05-24
Anticipated expiration: 2018-08-19
Also published as: JP2000066698A

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、複数の音源が混
在した環境下において、目的とする一つあるいは複数の
音源信号を分離・抽出し、音源の位置や属性に応じて、
目的音源別に認識する音認識装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention separates and extracts one or more target sound source signals in an environment in which a plurality of sound sources coexist, and according to the position and attribute of the sound source,
The present invention relates to a sound recognition device that recognizes each target sound source.

【０００２】[0002]

【従来の技術】一般に音声を認識する装置においては、
周囲騒音が外乱となり認識性能が劣化したり、周囲騒音
に対して認識装置が誤動作してしまい、認識率が大幅に
低下したり、認識できないという問題点がある。また、
複数の人が同時に発声する場合も認識装置が誤動作して
しまい、認識率の大幅な低下や認識できないという問題
点がある。２話者の混合音声に対し、同時認識する実験
検討が、奥野ほかで検討されている。（「音声ストリー
ム分離法の提案と複数音声の同時認識の予備実験」、情
処学会論文誌、Ｖｏｌ３８、Ｎｏ３、ｐ．ｐ．５１０〜
５２２）。これは、各音声の調波構造の特徴を利用して
抽出した各音声の分離信号に対して音声認識装置を接続
することで、２話者の同時音声認識について実験したも
のである。しかしながら、この構成は、認識処理部が各
音声別に必要なため、音声の数が増加すると装置規模が
大きくなるという課題があった。また、音声以外の音が
混在した場合は、認識できないという課題があった。2. Description of the Related Art Generally, in a device for recognizing speech,
There are problems that the ambient noise causes disturbance and the recognition performance is deteriorated, the recognition device malfunctions with respect to the ambient noise, the recognition rate is significantly reduced, and the recognition cannot be performed. Also,
Even when a plurality of people speak at the same time, the recognition device malfunctions, resulting in a large reduction in recognition rate and the problem that recognition is not possible. Okuno et al. Are studying an experimental study of simultaneous recognition of mixed speech from two speakers. ("Proposal of audio stream separation method and preliminary experiment of simultaneous recognition of multiple voices", Journal of Information Processing Society, Vol 38, No 3, pp. 510-510)
522). This is an experiment on simultaneous voice recognition of two speakers by connecting a voice recognition device to a separated signal of each voice extracted by utilizing the characteristics of the harmonic structure of each voice. However, this configuration requires a recognition processing unit for each voice, and thus has a problem that the device scale increases as the number of voices increases. In addition, there is a problem that when sounds other than voice are mixed, they cannot be recognized.

【０００３】一部のカーナビゲーションシステムに、音
声認識装置が組み込まれたものが使用されている。これ
らはドライバーの音声認識を行う際に同乗者の音声や車
内の騒音やカーオーディオの音楽等が外乱となり、認識
性能が著しく劣化するという問題点があった。また、運
転席の人の音声にも、助手席や後部座席の人の音声にも
同じように音声認識装置が動作してしまうため、これら
の座席毎に認識対象のコマンドを変えることはできない
という課題があった。そのため、運転席において安全上
の観点から認識コマンドを限定することと、助手席にお
いて利便性の観点から認識コマンドの適用範囲を増やす
こととを両立させることは不可能であった。Some car navigation systems are equipped with a voice recognition device. These have a problem in that the voice of the passenger, the noise in the car, the music of the car audio, etc. are disturbed when the voice of the driver is recognized, and the recognition performance is significantly deteriorated. Also, since the voice recognition device operates in the same way for the voice of the person in the driver's seat as well as the voice of the person in the passenger's seat or the rear seat, it is said that the command to be recognized cannot be changed for each of these seats. There were challenges. Therefore, it has been impossible to both limit the recognition command in the driver's seat from the viewpoint of safety and increase the range of application of the recognition command in the passenger's seat from the viewpoint of convenience.

【０００４】そのため、現状の音声認識機能付カーナビ
ゲーションシステムにおいては、その適用範囲がドライ
バーに負担をかけない軽微な操作にとどまっている。こ
れは、運転に専念すべきドライバーに配慮した為である
が、結果として、同乗者の利便性に大きな制約を与える
という問題点があった。また、現在の音声認識は、対象
とする音源が単独で発せられ、対象以外の音はほとんど
ないような状況（例えば、騒音のほとんどない環境で、
単独発声された音声）において、認識性能が発揮するよ
うに構成されている。このため、複数の人の音声や種々
の生活環境音などが混在している実際の環境では、認識
性能が著しく劣化したり認識できないという問題点があ
った。すなわち、人間の聴覚において実現されているよ
うな複数の音源発生時（同時発生時も含む）の各音源の
位置毎の音源の認識や、話者毎の音声の認識や、音声と
環境音とを区別した上での音源の認識はできないという
課題があった。Therefore, in the current car navigation system with a voice recognition function, its application range is limited to a minor operation that does not burden the driver. This is because consideration was given to the driver who should concentrate on driving, but as a result, there was a problem that the convenience of passengers was greatly restricted. In addition, in the current voice recognition, the target sound source is emitted alone, and there is almost no sound other than the target (for example, in an environment with almost no noise,
It is configured so that the recognition performance is exhibited in the case of a voice that is uttered independently. For this reason, in an actual environment in which voices of a plurality of people and various living environment sounds are mixed, there is a problem that recognition performance is significantly deteriorated or recognition is impossible. That is, when a plurality of sound sources are generated (including simultaneous generation) as realized in human hearing, the recognition of the sound source at each position of each sound source, the recognition of the sound of each speaker, the sound and the environmental sound, There was a problem that the sound source could not be recognized after distinguishing between.

【０００５】この発明の目的は、このように複数の音源
が混在した環境下において、目的とする一つあるいは複
数の音源信号を分離・抽出し、音源の位置や属性に応じ
て、目的音源別に認識する認識装置を提供することにあ
る。An object of the present invention is to separate and extract one or a plurality of target sound source signals in such an environment in which a plurality of sound sources are mixed, and to separate each target sound source according to the position and the attribute of the sound source. It is to provide a recognition device for recognizing.

【０００６】[0006]

【課題を解決するための手段】請求項１記載の発明は、
複数の音源から音源情報が音源分離手段で分離され、前
記音源分離手段から出力される認識対象の音源情報が信
号蓄積手段で蓄積され、認識処理のための認識辞書を蓄
積する認識辞書蓄積手段を利用して、前記信号蓄積手段
から出力される音源情報と前記認識辞書蓄積手段の認識
辞書とを照合し、認識結果が認識処理手段から出力され
る。つまり、個々の分離・抽出された音源信号に応じて
認識処理が行われる。The invention according to claim 1 is
The sound source information is separated from a plurality of sound sources by the sound source separating means, the sound source information of the recognition target output from the sound source separating means is accumulated by the signal accumulating means, and a recognition dictionary accumulating means for accumulating a recognition dictionary for recognition processing is provided. Utilizing this, the sound source information output from the signal storage means and the recognition dictionary of the recognition dictionary storage means are collated, and the recognition result is output from the recognition processing means. That is, the recognition process is performed according to each of the separated and extracted sound source signals.

【０００７】前記音源分離手段から出力される分離され
た各音源情報から音源の位置や音源の属性（例えば、話
者別、音声と生活環境音の別、生活環境音の種類別、音
源別）を音源判定手段で判定し、認識辞書選定手段では
前記認識辞書蓄積手段に蓄積された認識辞書から必要な
辞書を選定し、認識制御手段では前記音源判定手段にて
判定された音源の位置や音源の属性に応じて、前記認識
辞書選定手段を制御し、判定された音源に必要な認識辞
書を前記認識辞書蓄積手段からとりだし、とりだした認
識辞書と前記信号蓄積手段に蓄積された音源情報とを前
記認識処理手段にて照合し、認識結果を出力する制御が
行われる。つまり、分離・抽出された音源に対して、そ
れに応じた認識辞書を用いた認識処理が、判定された各
音源に必要な認識辞書を選定することによって、一つの
認識処理手段によって行われる。[0007] before Symbol sound source separation means is output from the isolated attribute of position and sound source of the sound source from the sound source information (for example, another speaker, another voice and living environment sound, by type of living environment sound, sound source by ) Is determined by the sound source determination means, the recognition dictionary selection means selects a necessary dictionary from the recognition dictionaries stored in the recognition dictionary storage means, and the recognition control means selects the position of the sound source determined by the sound source determination means. The recognition dictionary selection means is controlled according to the attribute of the sound source, and the recognition dictionary required for the determined sound source is taken out from the recognition dictionary storage means, and the extracted recognition dictionary is obtained.
The recognition processing unit collates the knowledge dictionary with the sound source information accumulated in the signal accumulating unit and outputs the recognition result. That is, the recognition processing using the recognition dictionary corresponding to the separated / extracted sound source is performed by one recognition processing unit by selecting the necessary recognition dictionary for each determined sound source.

【０００８】請求項２記載の発明は、複数の音源情報が
前記信号蓄積手段で蓄積され、前記信号蓄積手段から該
当する音源情報の入出力が蓄積切替選定手段で選定さ
れ、認識制御手段では前記音源判定手段にて判定された
音源の位置や音源の属性に応じて、前記蓄積切替選定手
段を制御し、各音源毎の音源情報を前記信号蓄積手段に
蓄積するとともに、前記信号蓄積手段に蓄積された音源
の位置や音源の属性に応じて、前記認識辞書選定手段を
制御し、必要な認識辞書を前記認識辞書蓄積手段からと
りだし、前記蓄積切替選定手段を制御し、前記信号蓄積
手段に蓄積された該当する音源情報とを前記認識処理手
段にて照合し、認識結果を出力する制御が行われる。つ
まり、複数の音源が同時発生した場合も、分離・抽出さ
れた各音源情報は別々に信号蓄積手段に蓄積されている
ので、この音源情報を個別に読み出して、それに応じた
認識処理が、各音源に必要な認識辞書を選定することに
よって、一つの認識処理手段によって行われる。[0008] According to a second aspect of the invention, the sound source information of the multiple is accumulated in the signal storage unit, input and output of the sound source information corresponding from said signal storage means is selected by storing switching selecting means, recognition control means In accordance with the position of the sound source determined by the sound source determining unit and the attribute of the sound source, the storage switching selecting unit is controlled to store the sound source information for each sound source in the signal storing unit and to the signal storing unit. The recognition dictionary selecting means is controlled according to the position of the accumulated sound source or the attribute of the sound source, a necessary recognition dictionary is taken out from the recognition dictionary accumulating means, the accumulation switching selecting means is controlled, and the signal accumulating means is controlled. The recognition processing means compares the stored corresponding sound source information with each other and outputs the recognition result. In other words, even when a plurality of sound sources occur at the same time, the separated and extracted sound source information is separately stored in the signal storage means, so that this sound source information is read out individually and the recognition processing corresponding to each is performed. This is performed by one recognition processing means by selecting the recognition dictionary required for the sound source.

【０００９】作用請求項１記載の発明においては、複数
の音源から音源情報を音源分離手段で分離され、分離さ
れた音源の認識対象の音源情報が信号蓄積手段で蓄積さ
れる。 [0009] In the invention of acting first aspect, the sound source information from a plurality of sound sources separated by the sound source separation means, the sound source information to be recognized in separated sound sources are stored in signal storage means.

【００１０】そして、前記音源分離手段から出力される
分離された各音源情報から音源の位置や音源の属性（例
えば、話者別、音声と生活環境音の別、生活環境音の種
類別、音源別）を音源判定手段で判定され、判定された
音源の位置や音源の属性に応じて、前記認識辞書選定手
段を制御し、判定された音源に必要な認識辞書を前記認
識辞書蓄積手段からとりだし、認識処理される。認識辞
書が判定された音源に対応して設けられ、各音源の認識
処理時に、各音源に必要な認識辞書を選定する構成のた
め、一つの認識処理手段によって、各音源毎の認識処理
が可能であり、認識処理の装置規模を低減できる。[0010] Then, the sound source separation means is output from the isolated attribute of position and sound source of the sound source from the sound source information (for example, another speaker, another voice and living environment sound, by type of living environment sound, sound source Another) is determined by the sound source determining means, and the recognition dictionary selecting means is controlled according to the position of the determined sound source and the attribute of the sound source, and the recognition dictionary required for the determined sound source is retrieved from the recognition dictionary accumulating means. , Recognition processing is performed. A recognition dictionary is provided corresponding to the determined sound source, and the recognition dictionary required for each sound source is selected at the time of recognition processing of each sound source, so that it is possible to perform recognition processing for each sound source by one recognition processing means. Therefore, the scale of the recognition processing device can be reduced.

【００１１】請求項２記載の発明においては、信号蓄積
手段から該当する音源情報の入出力の選定切替えを行う
蓄積切替選定手段があり、前記音源判定手段にて判定さ
れた音源の位置や音源の属性に応じて、前記蓄積切替選
定手段を制御し、各音源毎の音源情報を前記信号蓄積手
段に蓄積する。そして、前記蓄積切替選定手段を制御
し、前記信号蓄積手段に蓄積された音源情報毎に、前記
認識辞書選定手段により前記認識辞書蓄積手段からとり
だされた各音源に必要な認識辞書を用い、認識処理され
る。信号蓄積手段と認識辞書蓄積手段が判定された音源
に対応して設けられているため、複数の音源が同時発生
した場合も、各音源の認識処理が可能である。つまり、
複数の音源が同時発生した場合も、分離・抽出された各
音源情報は別々に信号蓄積手段に蓄積されているので、
この音源情報を個別に順次読み出して、それに応じた認
識処理が、各音源に必要な認識辞書を選定することによ
って、一つの認識処理手段によって行うことが可能であ
り、認識処理の装置規模を低減できる。According to the second aspect of the invention, there is a storage switching selection means for selectively switching the input / output of the corresponding sound source information from the signal storage means, and the position of the sound source and the sound source determined by the sound source determination means. The storage switching selection means is controlled according to the attribute, and the sound source information for each sound source is stored in the signal storage means. Then, the storage switching selection means is controlled, and for each sound source information stored in the signal storage means, the recognition dictionary necessary for each sound source extracted from the recognition dictionary storage means by the recognition dictionary selection means is used. It is recognized. Since the signal accumulating means and the recognition dictionary accumulating means are provided corresponding to the determined sound source, even when a plurality of sound sources occur simultaneously, the recognition processing of each sound source is possible. That is,
Even when multiple sound sources occur at the same time, the separated and extracted sound source information is stored separately in the signal storage means.
This sound source information is sequentially read out individually, and the recognition processing corresponding thereto can be performed by one recognition processing means by selecting the recognition dictionary required for each sound source, and the device scale of the recognition processing can be reduced. it can.

【００１２】[0012]

【発明の実施の形態】提案例１図１は提案された例の機能構成を示し、図４にその提案
例の処理手順を示す。種々の音源信号が混在した信号Ｓ
が取り込まれ（取り込みチャネルとしては、音源分離部
の構成により、単独チャネルのほか、複数チャネルの利
用もある）、それを電気信号として読み込む（Ｓ０
２）。 Proposed Example 1 Figure 1 DETAILED DESCRIPTION OF THE INVENTION show a functional configuration of the proposed example, a process procedure of the proposed <br/> example in FIG. Signal S in which various sound source signals are mixed
Is taken in (as a take-in channel, a single channel may be used as well as a plurality of channels depending on the configuration of the sound source separation unit), and it is read as an electric signal (S0
2).

【００１３】音源分離部１では、取り込んだ信号Ｓを各
音源信号Ｓ_a，…，Ｓ_nに分離する（Ｓ０３）。音源の
分離方法としては、例えば「チャネル間の情報に着目し
た２音源分離手法の検討」、日本音響学会講演論文集、
１−７−１３、ｐ．ｐ．４８９〜４９０（１９９６．
９）や「発声音声の音場分布差を利用した騒音抑圧処
理」、電子情報通信学会総合大会、Ｄ−１４−１６、
ｐ．ｐ．２２７（１９９８．３）に記載の手法で実現さ
れる。The sound source separation unit 1 separates the received signal S into sound source signals S _a , ..., S _n (S03). As a method of separating sound sources, for example, "Examination of two sound source separation methods focusing on information between channels", Proceedings of the Acoustical Society of Japan,
1-7-13, p. p. 489-490 (1996.
9) and “Noise suppression processing using difference in sound field distribution of vocalized voice”, IEICE General Conference, D-14-16,
p. p. 227 (1998.3).

【００１４】音源分離部１で分離された信号Ｓ_a，…，
Ｓ_nは各々信号蓄積部２（ａ〜ｎは信号Ｓａ〜Ｓｎに対
応）に蓄積される（Ｓ０４）。信号蓄積部２でそれぞれ
蓄積された信号Ｓ_a，…，Ｓ_nは、各々取り出され（Ｓ
０５）、認識処理部４では、認識辞書蓄積部３に格納さ
れている辞書情報（ａ〜ｎは信号Ｓａ〜Ｓｎに対応）を
用いて、信号Ｓ_a，…，Ｓ_nの各々について認識照合を
行う（Ｓ０６）。認識照合結果は、認識処理部４より出
力される（Ｓ０７）。The signals S _a , ..., Separated by the sound source separation unit 1
S _n is stored in each of the signal storage units 2 (a to _n correspond to the signals Sa to Sn) (S04). The signals S _a , ..., S _n respectively accumulated in the signal accumulator 2 are taken out (S
05), the recognition processing section 4, by using the dictionary information stored in the recognition dictionary storage portion 3 (a to n corresponding to the signal Sa～Sn), signal S _a, ..., recognition matching for each S _n Is performed (S06). The recognition matching result is output from the recognition processing unit 4 (S07).

【００１５】実施例１図２は請求項１の発明の実施例の機能構成を示し、図５
は請求項１の発明の実施例の処理手順を示す。図５にお
いて、始め（Ｓ０１）から（Ｓ０３）までの処理は図４
と同じである。以下で、図５中の（Ｓ０４）からの処理
について説明する。 Embodiment 1 FIG. 2 shows a functional configuration of an embodiment of the invention of claim 1 , and FIG.
Shows the processing procedure of the embodiment of the invention of claim 1 . In FIG. 5, the processes from the beginning (S01) to (S03) are shown in FIG.
Is the same as. The processing from (S04) in FIG. 5 will be described below.

【００１６】図２において、音源判定部５では、音源分
離部１で分離・抽出された信号Ｓ_a，…，Ｓ_nについ
て、各々の信号情報より音源の位置や属性（例えば、話
者別、音声と生活環境音の別、生活環境音の種類別、音
源別）を判定する（Ｓ０４）。音源の位置の判定は、例
えば、前記論文に示す手法により、分離された信号のレ
ベル等を比較することで、各音源の方向や音源の距離
（遠近など）を判定できる。また、属性の判定は、各音
源のスペクトル軸上や時間軸上の特徴の違いを利用して
判定できる。例えば信号Ｓ_a，…，Ｓ_nの認識装置の使
用環境に応じた音源のスペクトルパターンを予め準備
し、それとの照合によって判定する方法を適用すること
ができる。In FIG. 2, the sound source determination unit 5 uses the signal information of each of the signals S _a , ..., S _n separated / extracted by the sound source separation unit 1 from the position and attribute of the sound source (for example, for each speaker, It is determined whether the sound is different from the living environment sound, the type of living environment sound, or the sound source) (S04). The position of the sound source can be determined, for example, by comparing the levels of the separated signals by the method described in the above-mentioned paper, and the direction of each sound source and the distance (distance, etc.) of the sound source can be determined. Further, the attribute determination can be performed by utilizing the difference in the characteristics of the sound sources on the spectrum axis and the time axis. For example, the signal S _a, ..., prepared in advance a spectrum pattern of a sound source in accordance with the use environment of the recognition device S _n, it is possible to apply a method of determining the collation with it.

【００１７】信号蓄積部２では、信号Ｓ_a，…，Ｓ_nの
なかで認識対象信号Ｓ_iを蓄積する（Ｓ０５）。信号蓄
積部２から蓄積された信号Ｓ_iは取り出され（Ｓ０
６）、認識処理部４に入力される。認識辞書選定部６で
は、信号Ｓ_iに対応する辞書を認識辞書蓄積部３（ａ〜
ｎは信号Ｓａ〜Ｓｎに対応）より取り出し（Ｓ０７）、
認識照合を行う（Ｓ０８）。認識照合結果は、認識処理
部４から認識制御部７を介して出力される（Ｓ０９）。The signal storage unit 2 stores the recognition target signal S _i among the signals S _a , ..., S _n (S05). The signal S _i accumulated from the signal accumulator 2 is taken out (S0
6) is input to the recognition processing unit 4. The recognition dictionary selection unit 6 stores the dictionary corresponding to the signal S _i in the recognition dictionary storage unit 3 (a to
(n corresponds to the signals Sa to Sn) (S07),
Recognition and verification are performed (S08). The recognition matching result is output from the recognition processing unit 4 via the recognition control unit 7 (S09).

【００１８】上記の処理を実現するため、認識制御部７
では音源判定部５の情報に応じて、・信号蓄積部２へ必要な信号Ｓ_iを蓄積するための制御・必要な蓄積信号Ｓ_iに応じた認識辞書を選定するため
の制御・認識処理の起動と認識結果の出力制御の３つの制御が行われている。In order to realize the above processing, the recognition control unit 7
Then, according to the information of the sound source determination unit 5, control for accumulating the necessary signal S _i in the signal accumulating unit 2, control for selecting a recognition dictionary according to the necessary accumulated signal S _i , and recognition processing Three controls are performed: start-up and output control of recognition results.

【００１９】請求項１記載の発明においては、音源分離
部１から出力される分離された各音源情報から音源の位
置や音源の属性（例えば、話者別、音声と生活環境音の
別、生活環境音の種類別、音源別）が音源判定部５で判
定され、判定された音源の位置や音源の属性に応じて、
認識辞書選定部６を制御し、判定された音源に必要な認
識辞書を認識辞書蓄積部３からとりだし、認識処理され
る。認識辞書が判定された音源に対応して設けられ、各
音源の認識処理時に、各音源に必要な認識辞書を選定す
る構成のため、一つの認識処理部４によって、各音源毎
の認識処理が可能であり、認識処理の装置規模を低減で
きる。なお、例えば、ある位置から、到来した単語音声
指令に応じた制御をする場合、判定された音源位置に応
じて、認識処理に用いる認識辞書が選択される。また生
活環境音を認識する場合は、通常の音声を認識する場合
と異なるその生活環境音の認識に必要な標準パターンを
もつ認識辞書を使用する。According to the first aspect of the present invention, the position of the sound source and the attribute of the sound source are classified from the separated sound source information output from the sound source separation unit 1 (for example, for each speaker, for each voice and living environment sound, and for life). The sound source determination unit 5 determines the type of the environmental sound and the sound source, and according to the determined position of the sound source and the attribute of the sound source,
The recognition dictionary selection unit 6 is controlled to extract the recognition dictionary required for the determined sound source from the recognition dictionary storage unit 3 and the recognition processing is performed. A recognition dictionary is provided corresponding to the determined sound source, and a recognition dictionary required for each sound source is selected at the time of recognition processing of each sound source. Therefore, the recognition processing for each sound source is performed by one recognition processing unit 4. This is possible, and the device scale of recognition processing can be reduced. Note that, for example, when performing control according to a word voice command that has arrived from a certain position, a recognition dictionary used for recognition processing is selected according to the determined sound source position. Further, when recognizing a living environment sound, a recognition dictionary having a standard pattern necessary for recognizing the living environment sound, which is different from the case of recognizing a normal voice, is used.

【００２０】認識辞書蓄積部３に蓄積されている各認識
辞書（ａ〜ｎ）は、上記では、１つの大容量蓄積部に蓄
積し、この大容量蓄積部から適切な認識辞書を選択して
用いるように説明したが、各認識辞書別に複数の認識辞
書蓄積部を設けて構成してもよい。実施例２図３は請求項２の発明の実施例の機能構成を示し、図６
は請求項２の発明の実施例の処理手順を示す。図６にお
いて、始め（Ｓ０１）から（Ｓ０４）までの処理は図５
と同じである。以下で、図６の（Ｓ０５）からの処理に
ついて説明する。In the above, the recognition dictionaries (a to n) stored in the recognition dictionary storage unit 3 are stored in one large-capacity storage unit, and an appropriate recognition dictionary is selected from this large-capacity storage unit. Although it has been described as being used, a plurality of recognition dictionary storage units may be provided for each recognition dictionary. Embodiment 2 FIG. 3 shows a functional configuration of an embodiment of the invention of claim 2 , and FIG.
Shows the processing procedure of the embodiment of the invention of claim 2 . In FIG. 6, the processing from the beginning (S01) to (S04) is shown in FIG.
Is the same as. The processing from (S05) in FIG. 6 will be described below.

【００２１】この発明では分離・抽出された信号Ｓ_a，
…，Ｓ_nは蓄積切替選定部８を介して、各々が複数の信
号蓄積部２（ａ〜ｎは信号Ｓａ〜Ｓｎに対応）に蓄積さ
れる（Ｓ０５）。なお、蓄積する信号は、分離・抽出さ
れた信号をすべて蓄積しても良く、また、分離・抽出さ
れた信号から必要な信号のみを蓄積してもよい。次に、
信号蓄積部２において蓄積されたＳ_a，…，Ｓ_nのう
ち、認識対象信号Ｓ_iが取り出され（Ｓ０６）、蓄積切
替選定部８を介して認識処理部４に入力される。In the present invention, the separated / extracted signals S _a ,
, S _n are respectively stored in the plurality of signal storage units 2 (a to n correspond to the signals Sa to Sn) via the storage switching selection unit 8 (S05). The signals to be stored may be all the separated / extracted signals, or may be only the necessary signals from the separated / extracted signals. next,
Of the signals S _a , ..., S _n accumulated in the signal accumulator 2, the recognition target signal S _i is extracted (S06) and input to the recognition processor 4 via the accumulation switching selector 8.

【００２２】認識辞書選定部６では、Ｓ_iに対応する辞
書を認識辞書蓄積部３（ａ〜ｎは信号Ｓａ〜Ｓｎに対
応）より取り出し（Ｓ０７）、認識照合を行う（Ｓ０
８）。認識照合結果は、認識処理部４から認識制御部７
を介して出力される（Ｓ０９）。なお、各音源が同時発
生された場合などのように認識対象が複数ある場合に
は、（Ｓ０９）の処理後、（Ｓ０６）に戻り、別の認識
対象信号が取り出され、（Ｓ０６）から（Ｓ０９）の処
理が繰り返される（Ｓ１０）。The recognition dictionary selection unit 6 retrieves the dictionary corresponding to S _i from the recognition dictionary storage unit 3 (a to n correspond to the signals Sa to Sn) (S07) and performs recognition collation (S0).
8). The recognition matching result is obtained from the recognition processing unit 4 to the recognition control unit 7.
(S09). When there are a plurality of recognition targets, such as when sound sources are simultaneously generated, after the processing of (S09), the process returns to (S06), another recognition target signal is extracted, and (S06) to ( The processing of S09 is repeated (S10).

【００２３】上記の処理を実現するため、認識制御部７
では音源判定部５の情報に応じて、・信号Ｓ_a，…，Ｓ_nを個別に信号蓄積するための信号
蓄積部２から対応する蓄積エリアａ〜ｎを選定するため
の制御・必要な蓄積信号Ｓ_iに応じた辞書を選定するための制
御・認識処理の起動と認識結果の出力制御の３つの制御が行われている。In order to realize the above processing, the recognition control unit 7
In according to the information of the sound source determination unit 5, - signal S _a, ..., the control and the necessary storage for selecting storage area a~n corresponding the signal storage unit 2 for individual signals accumulated S _n Three controls are performed: activation of a control / recognition process for selecting a dictionary corresponding to the signal S _i and output control of a recognition result.

【００２４】請求項２記載の発明においては、信号蓄積
部２から対応する蓄積エリアａ〜ｎの選定切替えを行う
蓄積切替選定部８があり、音源判定部５にて判定された
音源の位置や音源の属性に応じて、蓄積切替選定部８を
制御し、各音源毎の音源情報を信号蓄積部２の該当する
蓄積エリアａ〜ｎに蓄積する。そして、蓄積切替選定部
８を制御し、信号蓄積部２に蓄積された音源毎に、認識
辞書選定部６により認識辞書蓄積部３からとりだされた
各音源に必要な認識辞書を用い、認識処理される。信号
蓄積部２と認識辞書蓄積部３が判定された音源に対応し
て蓄積されているため、複数の音源が同時発生した場合
も、各音源の認識処理が可能である。つまり、複数の音
源が同時発生した場合も、分離・抽出された各音源情報
は別々に信号蓄積部２に蓄積されているので、この音源
情報を個別に順次読み出して、それに応じた認識処理
が、各音源に必要な認識辞書を選定することによって、
一つの認識処理部４によって行うことが可能であり、認
識処理の装置規模を低減できる。According to the second aspect of the present invention, there is a storage switching selection unit 8 for selecting and switching the corresponding storage areas a to n from the signal storage unit 2, and the position of the sound source judged by the sound source judgment unit 5 and The storage switching selection unit 8 is controlled according to the attribute of the sound source, and the sound source information for each sound source is stored in the corresponding storage areas a to n of the signal storage unit 2. Then, the storage switching selection unit 8 is controlled to perform recognition for each sound source stored in the signal storage unit 2 by using the recognition dictionary necessary for each sound source extracted from the recognition dictionary storage unit 3 by the recognition dictionary selection unit 6. It is processed. Since the signal accumulating unit 2 and the recognition dictionary accumulating unit 3 are accumulated in correspondence with the determined sound source, even if a plurality of sound sources occur simultaneously, the recognition processing of each sound source is possible. That is, even when a plurality of sound sources occur at the same time, the separated and extracted sound source information is separately stored in the signal storage unit 2. Therefore, the sound source information is sequentially read out individually, and the corresponding recognition processing is performed. , By selecting the recognition dictionary required for each sound source,
This can be performed by one recognition processing unit 4, and the device scale of recognition processing can be reduced.

【００２５】以上の説明では、信号蓄積部２と認識辞書
蓄積部３は、それぞれ、音源毎の蓄積エリアを複数持っ
た１つの大容量蓄積部として構成された例を説明した
が、音源毎の複数の信号蓄積部や認識辞書蓄積部を設け
て構成してもよい。In the above description, the signal accumulating section 2 and the recognition dictionary accumulating section 3 are configured as one large-capacity accumulating section having a plurality of accumulating areas for each sound source. A plurality of signal storage units and recognition dictionary storage units may be provided and configured.

【００２６】[0026]

【発明の効果】以上説明したように、この発明により、
複数の音源が混在した環境下において、目的とする一つ
あるいは複数の音源信号を分離・抽出し、音源の位置や
属性に応じて、目的音源別に認識処理をする音認識装置
の提供が可能になる。このため、いくつかの音源が混在
し、周囲騒音が大きな環境での認識性能の大幅な改善が
可能である。また、複数の人が同時に発声した音声を個
々に分離したり、話者別に個々に認識することも可能で
ある。As described above, according to the present invention,
In an environment where multiple sound sources are mixed, it is possible to provide a sound recognition device that separates and extracts one or more target sound source signals and performs recognition processing for each target sound source according to the position and attributes of the sound source. Become. For this reason, it is possible to significantly improve the recognition performance in an environment in which several sound sources are mixed and ambient noise is large. It is also possible to separate the voices produced by a plurality of people at the same time, or to recognize each speaker individually.

【００２７】図７は、２つの異なる位置から同時に発せ
られる音声について、それぞれの位置について音源を分
離し認識させた場合についての認識実験結果例を示した
ものである。この発明の採用により、目的音声の認識率
が著しく改善し、音源を分離させた認識が可能であるこ
とを示している。 FIG. 7 shows an example of a recognition experiment result in the case where a sound source is separated and recognized for each position of voices simultaneously emitted from two different positions. It has been shown that the adoption of the present invention significantly improves the recognition rate of the target voice and enables the recognition with separated sound sources .

【００２８】請求項１記載の発明は、各音源の認識処理
時に、各音源に必要な認識辞書を選定する構成のため、
一つの認識処理部によって、各音源毎の認識処理が可能
であり、認識処理の装置規模を低減でき、装置の経済化
に有効である。請求項２記載の発明は、信号蓄積部と認
識辞書蓄積部が判定された音源に対応して蓄積されてい
るため、複数の音源が同時発生した場合も、複数の音源
に対して一つの認識処理部によって行うことが可能であ
り、認識処理の装置規模を大幅に低減でき、装置の経済
化に有効である。According to the first aspect of the present invention, the recognition dictionary required for each sound source is selected during the recognition processing of each sound source.
The recognition processing can be performed for each sound source by one recognition processing unit, the apparatus scale of the recognition processing can be reduced, and it is effective in making the apparatus economical. In the invention according to claim 2 , since the signal storage unit and the recognition dictionary storage unit are stored corresponding to the determined sound source, even when a plurality of sound sources occur simultaneously, one recognition is performed for the plurality of sound sources. This can be performed by the processing unit, and the scale of the recognition processing device can be significantly reduced, which is effective in making the device economical.

【００２９】応用としては、例えば車室内において、ド
ライバー音声や、個々の同乗者の音声、カーオーディオ
音、走行騒音等の各音源を分離し、個別に認識させるシ
ステムの提供が可能となる。運転席においては、安全上
の観点から認識コマンドを限定すること、助手席や後部
座席においては、利便性の観点から認識コマンドの適用
範囲を増やすこととを両立させることができる。また、
カーオーディオ音のみに着目し、例えば、特定の交通情
報のみを認識し、その情報を自動蓄積したり、運転者に
報知することもできる。また、走行音のみに着目し、走
行音が大きくなると自動的に窓を閉めて空調をつけた
り、騒音除去装置を起動させたりすることもできる。ま
た、走行中に外部からクラクション等が鳴らされた場合
は、自動的にカーオーディオ等の音量を小さくして運転
者に警報したりすることもできる。As an application, for example, it is possible to provide a system in which each sound source such as a driver voice, a voice of an individual passenger, a car audio sound, and a running noise is separated and individually recognized in a vehicle interior. In the driver's seat, the recognition command can be limited from the viewpoint of safety, and in the passenger seat and the rear seat, the range of application of the recognition command can be increased from the viewpoint of convenience. Also,
Focusing only on the car audio sound, for example, it is possible to recognize only specific traffic information, automatically accumulate the information, or notify the driver. It is also possible to focus only on the running noise and automatically close the window to turn on the air conditioner or activate the noise removing device when the running noise becomes loud. Further, when a horn or the like is sounded from the outside while the vehicle is running, the volume of the car audio or the like can be automatically reduced to warn the driver.

【００３０】別の応用としては、例えば家庭内におい
て、種々の家電機器の音声制御への応用や家庭環境にお
ける異常音を検出するセキュリティーシステム等への応
用が考えられる。種々の装置を音声コマンドで実行させ
るシステムの提供が可能となる。例えば、特定の位置か
らの音声にしか制御できない家電機器や、音声コマンド
の有効範囲を話者により区別するような応用が考えられ
る。また、侵入者等の異常音を検出し報知するシステム
も考えられる。As another application, for example, at home, application to voice control of various home electric appliances and application to a security system for detecting an abnormal sound in a home environment can be considered. It is possible to provide a system for executing various devices by voice commands. For example, a home electric appliance that can be controlled only by a voice from a specific position, or an application in which the effective range of a voice command is distinguished by a speaker can be considered. In addition, a system that detects and notifies an abnormal sound of an intruder or the like can be considered.

【００３１】また、以上の応用において、各音源を分離
（目的以外の音を抑圧・除去）して認識させているた
め、認識性能を著しく向上させることができる。この発
明は、人間の聴覚において実現されているような複数の
音源発生時（同時発生時も含む）の各音源の位置毎の音
源の認識や、話者毎の音声の認識や、音声と環境音とを
区別した上での音源の認識を可能とするものであり、上
記の応用に限らず、あらゆる産業において、インテリジ
ェントシステムやロボット等の聴覚機能として利用でき
るものである。Further, in the above application, since each sound source is separated (sounds other than the intended one are suppressed / removed) to be recognized, the recognition performance can be remarkably improved. This invention recognizes a sound source for each position of each sound source when multiple sound sources are generated (including simultaneous generation) as realized in human hearing, recognizes a voice for each speaker, and recognizes a voice and an environment. The sound source can be recognized by distinguishing it from the sound, and can be used as an auditory function of intelligent systems, robots, and the like in not only the above applications but also in any industry.

[Brief description of drawings]

【図１】この発明に関わる提案例の機能構成を示す図。FIG. 1 is a diagram showing a functional configuration of a proposed example according to the present invention.

【図２】この発明の第１の実施例の機能構成を示す図。FIG. 2 is a diagram showing a functional configuration of a first embodiment of the present invention.

【図３】この発明の第２の実施例の機能構成を示す図。FIG. 3 is a diagram showing a functional configuration of a second embodiment of the present invention.

【図４】この発明に関わる提案例の動作を説明するフロ
ーチャート。FIG. 4 is a flowchart illustrating the operation of a proposed example according to the present invention.

【図５】この発明の第１の実施例の動作を説明するフロ
ーチャート。FIG. 5 is a flowchart explaining the operation of the first embodiment of the present invention.

【図６】この発明の第２の実施例の動作を説明するフロ
ーチャート。FIG. 6 is a flowchart explaining the operation of the second embodiment of the present invention.

【図７】この発明を用いた実験結果例を示す図。FIG. 7 is a diagram showing an example of an experimental result using the present invention.

───────────────────────────────────────────────────── フロントページの続き (72)発明者岡本学東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 (72)発明者青木茂明東京都新宿区西新宿三丁目19番２号日本電信電話株式会社内 (56)参考文献特開平８−314489（ＪＰ，Ａ) 特開昭63−56698（ＪＰ，Ａ) 特開平９−33329（ＪＰ，Ａ) 奥乃，中谷，川端，音声ストリーム分離法の提案と複数音声の同時認識の予備実験，情報処理学会論文誌，日本, 1997年３月15日，Ｖｏｌ．38，Ｎｏ．３，，Ｐａｇｅｓ 510−523 武田，板倉，音源分離による音声認識性能の改善，日本音響学会誌，日本, 1997年11月１日，53巻， 11号，Ｐａｇｅｓ 883−888 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 17/00 G10L 21/00 - 21/02 ─────────────────────────────────────────────────── ─── Continuation of front page (72) Inventor Manabu Okamoto 3-19-3 Nishishinjuku, Shinjuku-ku, Tokyo Inside Nippon Telegraph and Telephone Corporation (72) Inventor Shigeaki Aoki 3-19-3 Nishishinjuku, Shinjuku-ku, Tokyo No. 2 Nihon Telegraph and Telephone Corporation (56) Reference JP-A-8-314489 (JP, A) JP-A-63-56698 (JP, A) JP-A-9-33329 (JP, A) Okuno, Nakatani , Kawabata, Proposal of Speech Stream Separation Method and Preliminary Experiment of Simultaneous Recognition of Multiple Speeches, Journal of Information Processing Society of Japan, Japan, March 15, 1997, Vol. 38, No. 3, Page 510-523 Takeda, Itakura, Improving speech recognition performance by sound source separation, The Acoustical Society of Japan, Japan, Nov. 1, 1997, Volume 53, No. 11, Pages 883-888 (58) Selected fields (Int.Cl. ⁷ , DB name) G10L 15/00-17/00 G10L 21/00-21/02

Claims

(57) [Claims]

1. A sound source separation means for separating sound source information from a plurality of sound sources, a signal storage means for storing sound source information of a recognition target output from the sound source separation means, and a recognition dictionary for recognition processing. The recognition dictionary accumulating means collates the sound source information output from the signal accumulating means with the recognition dictionary of the recognition dictionary accumulating means, outputs a recognition result, and outputs the recognition result . Sound source information
Sound source determination means for determining the position of the sound source and the attribute of the sound source from
From the recognition dictionaries stored in the recognition dictionary storage means.
A recognition dictionary selecting means for selecting a dictionary, a position of the sound source determined by the sound source determining means, and a genre of the sound source.
Depending on the sex, the recognition dictionary selection means is controlled and judged.
The recognition dictionary necessary for the sound source from the recognition dictionary storage means
Accumulation in the extraction and extraction recognition dictionary and the signal accumulating means
The recognized sound source information is collated by the recognition processing means and recognized.
A sound recognition device, comprising: a recognition control unit that performs control for outputting a result .

2. A sound source for separating sound source information from a plurality of sound sources.
Separation means and accumulation of plural sound source information output from the sound source separation means
Accumulating means for accumulating and a recognition dictionary accumulating means for accumulating a recognition dictionary for recognition processing
And the sound source information output from the signal storage means and the recognition
It collates with the recognition dictionary of the book storage means and outputs the recognition result.
Recognition processing means and separated sound source information output from the sound source separation means
Sound source determination means for determining the position of the sound source and the attribute of the sound source from
From the recognition dictionaries stored in the recognition dictionary storage means.
A recognition dictionary selecting means for selecting a dictionary, a position of the sound source determined by the sound source determining means, and a genre of the sound source.
Depending on the sex, the recognition dictionary selection means is controlled and judged.
Was a recognition dictionary necessary to the sound source from the recognition dictionary accumulation hand stage
Accumulation in the extraction and extraction recognition dictionary and the signal accumulating means
The recognized sound source information is collated by the recognition processing means and recognized.
It has a recognition control means for controlling the output of the result, and selects the input / output of the corresponding sound source information from the signal storage means.
Is provided with a storage switching selection unit that determines the position of the sound source and the attribute of the sound source determined by the sound source determination unit.
The sound source switching selection means is controlled according to the
The sound source information of the
The position of the sound source and the attribute of the sound source stored in the signal storage means
According to the control, the recognition dictionary selection means is controlled so that the necessary recognition
Take out the document from the recognition dictionary storage means and switch the storage
Controls the selection means and stores the corresponding signals stored in the signal storage means.
The sound source information to be reproduced is collated by the recognition processing means, and the recognition result is obtained.
Recognition control means for controlling the output of the result is provided.
A sound recognition device characterized by the above.