JP2005354223A

JP2005354223A - Sound source information processing apparatus, sound source information processing method, and sound source information processing program

Info

Publication number: JP2005354223A
Application number: JP2004170429A
Authority: JP
Inventors: Masahide Arisei; 政秀蟻生
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2004-06-08
Filing date: 2004-06-08
Publication date: 2005-12-22

Abstract

<P>PROBLEM TO BE SOLVED: To provide a sound source information processing apparatus with high practicality, capable of carrying out a processing with higher surety and performing processing matching the purpose by taking into consideration a surrounding environment, and to provide a sound source information processing method and a sound source information processing program. <P>SOLUTION: The sound source information processing apparatus includes a plurality of sound collection means (101-1 to 101-N) for collecting sound signals of a plurality of sound sources; a sound source estimating means 104 for estimating the direction of at least one sound source on the basis of the sound signals collected by a plurality of the sound collection means; a sound source information extracting means 108 for extracting the sound source information on a target sound source among the sound sources in the estimated direction; a sound source state information acquiring means 109 for acquiring the sound source state information on at least either of one or more sound source and surrounding states for the purpose of selecting processing contents applied to the extracted sound source information; and a processing means 110 for applying prescribed processing to the extracted sound source information on the basis of the sound source state information. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、複数の収音手段からの入力信号を用いて推定された音源情報を、目的や状況にあわせて適切に処理するための音源情報処理装置、音源情報処理方法、及び音源情報処理プログラムに関するものである。 The present invention relates to a sound source information processing apparatus, a sound source information processing method, and a sound source information processing program for appropriately processing sound source information estimated using input signals from a plurality of sound collecting means in accordance with purposes and situations It is about.

従来、周囲雑音のある環境下において目的音のみを収音する方法であり、複数のマイクを使ったマイクロホンアレイの技術は様々なものが提案されている（例えば非特許文献１参照）。これらは主に、複数のマイクから入力された信号の中から目的となる音声を判断及びその音源方向を定位して、その定位方向に指向性をあわせたフィルタ処理をかけることで目的となる音声を収音するものである。ここではこのように複数の音源が考えられる中で、対象となる音源の情報を取り出す問題を音源分離問題と呼ぶことにする。目的によっては音源方向の定位のみである場合や、音源からの音声情報の抽出までを指す場合を含むこともある。 Conventionally, it is a method of collecting only a target sound in an environment with ambient noise, and various techniques of a microphone array using a plurality of microphones have been proposed (for example, see Non-Patent Document 1). These are mainly intended to determine the target audio from the signals input from multiple microphones, localize the sound source direction, and apply the filter processing that matches the directivity to the localization direction. The sound is collected. Here, in the case where a plurality of sound sources are considered in this way, the problem of extracting information on a target sound source is referred to as a sound source separation problem. Depending on the purpose, there may be cases where only the localization of the sound source direction or the extraction of audio information from the sound source is indicated.

このような音源分離問題においては、現実の環境下では音源分離に大きな影響を与える問題がいくつかある。このような問題としては音源の数とマイクの数との関係、雑音や残響、音源の移動が主要なものとして挙げられる。これらの問題に対応するものとして例えば短時間で所望の方向に指向性を得る手段、音源方向が時間と共に変化する場合にも指向性を得るための手段を提供しているものがある（例えば、特許文献１参照）。また、少ない演算量で安定して目的音源の方向検出と雑音抑圧を行う方法について扱っているものがある（例えば、特許文献２参照）。 In such a sound source separation problem, there are several problems that greatly affect sound source separation in an actual environment. Such problems include the relationship between the number of sound sources and the number of microphones, noise and reverberation, and movement of sound sources. In order to deal with these problems, there are, for example, a means for obtaining directivity in a desired direction in a short time and a means for obtaining directivity even when the sound source direction changes with time (for example, Patent Document 1). In addition, there is a method dealing with a method of detecting the direction of a target sound source and suppressing noise stably with a small amount of computation (see, for example, Patent Document 2).

特開２００２―１８６０８４号公報Japanese Patent Laid-Open No. 2002-186084 特開平１０−２０７４９０号公報Japanese Patent Laid-Open No. 10-207490 金田豊著「騒音下音声認識のためのマイクロホンアレー技術」日本音響学会誌５３巻１１号（１９９７）、ｐｐ．８７２−８７６Yutaka Kaneda "Microphone array technology for speech recognition under noise" Journal of the Acoustical Society of Japan, Vol. 53, No. 11 (1997), pp. 872-876

このように現実の環境下で問題となるいくつかの項目について、それぞれ効果のある手法や手段が提案されているが、実際には問題となる原因が複合的に発生することにより、提案されている手法で仮定している条件を逸脱する場合が生じる。特にこれらの解決手段として適応的な信号処理手法を用いている場合には、ある制約のもとでの最適化問題を解くという定式化が一般的であるため、制約を満たせないような環境になると推定自体もうまくいかなくなる可能性がある。このような場合には、上記のような音源分離技術を用いても十分な性能が得られない以外に、想定外の動作や、結果として全く用をなさなくなってしまうことは十分考えられることである。つまり、推定された音源に関する情報を、目的と周囲の環境の様子によっては修正した方が実際の使用時に適切であると考えられ、従来の技術では必ずしもそのような点に留意していなかったという問題がある。 In this way, effective methods and methods have been proposed for some items that are problematic in the actual environment, but in reality they have been proposed due to multiple causes of problems. There are cases where it deviates from the conditions assumed in the method. Especially when an adaptive signal processing method is used as a solution to these problems, it is common to formulate solving optimization problems under certain constraints. Then, the estimation itself may not work. In such a case, it is quite possible that unexpected performance and no use at all will result in not being able to obtain sufficient performance even if the sound source separation technology as described above is used. is there. In other words, it is considered that it is more appropriate to correct the information about the estimated sound source depending on the purpose and the surrounding environment, and the conventional technology did not always pay attention to such a point. There's a problem.

すなわち、上述したように、音源分離問題を処理する音源情報処理システムが実際に使われる環境において、音源を推定する手法自体が十分に対応できない状況になったり、周囲環境によってはシステムに期待される動作が規定されたりすることが考えられる。したがって、周囲の環境を考慮し実用性に優れる音源情報処理システムは未だ確立されていないのが現状である。 In other words, as described above, in an environment where a sound source information processing system for processing a sound source separation problem is actually used, the method of estimating a sound source itself cannot sufficiently cope with the situation, or the system is expected depending on the surrounding environment. It is possible that the operation is specified. Therefore, at present, a sound source information processing system that is excellent in practicality in consideration of the surrounding environment has not yet been established.

本発明は、上記に鑑みてなされたものであって、周囲の環境を考慮した、より確実性の高い処理を行うことができる実用性の高い音源情報処理装置、音源情報処理方法、及び音源情報処理プログラムを得ることを目的とする。 The present invention has been made in view of the above, and has a highly practical sound source information processing apparatus, sound source information processing method, and sound source information capable of performing more reliable processing in consideration of the surrounding environment. The purpose is to obtain a processing program.

上述した課題を解決し、目的を達成するために、本発明にかかる音源情報処理装置は、複数の音源の音声信号を収音する複数の収音手段と、複数の収音手段で収音された音声信号に基づいて少なくとも１つの音源の方向を推定する音源推定手段と、推定した方向の音源のうち目的音源に関する音源情報を抽出する音源情報抽出手段と、抽出した音源情報に対して施す処理内容を選択するための、１つまたは複数の音源と周囲の状況のうち少なくとも一方に関する音源状況情報を取得する音源状況情報取得手段と、音源状況情報に基づいて、抽出された音源情報に対して所定の処理を施す処理手段と、を備えることを特徴とする。 In order to solve the above-described problems and achieve the object, a sound source information processing apparatus according to the present invention is picked up by a plurality of sound collecting means for collecting sound signals of a plurality of sound sources and a plurality of sound collecting means. Sound source estimation means for estimating the direction of at least one sound source based on the obtained sound signal, sound source information extraction means for extracting sound source information relating to the target sound source from among the sound sources in the estimated direction, and processing performed on the extracted sound source information Sound source status information acquisition means for acquiring sound source status information related to at least one of one or a plurality of sound sources and surrounding conditions for selecting content, and for the sound source information extracted based on the sound source status information And a processing means for performing a predetermined process.

また、本発明にかかる音源情報処理方法は、複数の音源の音声信号を収音する収音工程と、収音工程で収音された音声信号に基づいて少なくとも１つの音源の方向を推定する音源推定工程と、推定した方向の音源のうち出力対象に関する音源情報を抽出する音源情報抽出工程と、抽出した音源情報に対して施す処理内容を選択するための、１つまたは複数の音源と周囲の状況のうち少なくとも一方に関する音源状況情報を取得する音源状況情報取得工程と、音源状況情報に基づいて、抽出された音源情報に対して所定の処理を施す処理工程と、を備えることを特徴とする。 In addition, a sound source information processing method according to the present invention includes a sound collection step for collecting sound signals of a plurality of sound sources, and a sound source for estimating a direction of at least one sound source based on the sound signals collected in the sound collection step. An estimation step, a sound source information extraction step for extracting sound source information relating to the output target from among the sound sources in the estimated direction, and one or a plurality of sound sources and surroundings for selecting processing contents to be performed on the extracted sound source information A sound source status information acquisition step for acquiring sound source status information related to at least one of the situations, and a processing step for performing predetermined processing on the extracted sound source information based on the sound source status information .

また、本発明にかかる音源情報処理プログラムは、複数の音源の音声信号を収音する収音工程と、収音工程で収音された音声信号に基づいて少なくとも１つの音源の方向を推定する音源推定工程と、推定した方向の音源のうち出力対象に関する音源情報を抽出する音源情報抽出工程と、抽出した音源情報に対して施す処理内容を選択するための、１つまたは複数の音源と周囲の状況のうち少なくとも一方に関する音源状況情報を取得する音源状況情報取得工程と、音源状況情報に基づいて、抽出された音源情報に対して所定の処理を施す処理工程と、をコンピュータに実行させることを特徴とする。 The sound source information processing program according to the present invention includes a sound collection step for collecting sound signals of a plurality of sound sources, and a sound source for estimating the direction of at least one sound source based on the sound signals collected in the sound collection step. An estimation step, a sound source information extraction step for extracting sound source information relating to the output target from among the sound sources in the estimated direction, and one or a plurality of sound sources and surroundings for selecting processing contents to be performed on the extracted sound source information Causing a computer to execute a sound source status information acquisition step of acquiring sound source status information relating to at least one of the situations, and a processing step of performing a predetermined process on the extracted sound source information based on the sound source status information Features.

この発明によれば、所定の判断基準に基づいて、音源推定手段で推定した出力対象となる音源情報に適切な処理を施す。具体的には、音源状況情報取得手段により所定の音源状況情報が取得され、該音源状況情報に基づいて、推定された音源の情報（出力対象となる音源情報）に対して施す所定の処理を選択して該処理を施す。したがって、この発明において出力される音源情報は、目的や周囲の状況に合わせて適切な処理が施された情報となる。 According to this invention, appropriate processing is performed on the sound source information to be output estimated by the sound source estimation means based on a predetermined criterion. Specifically, predetermined sound source status information is acquired by the sound source status information acquisition means, and predetermined processing to be performed on the estimated sound source information (sound source information to be output) based on the sound source status information is performed. Select and apply this process. Therefore, the sound source information output in the present invention is information that has been appropriately processed in accordance with the purpose and the surrounding situation.

上記の音源状況情報は、最終的な出力には直接は含まれないが音源及びその周囲の情報を知るため、すなわち処理手段での処理を選択する判断に用いられるために抽出される情報である。この音源状況情報は音源情報から得られた情報でも良く、また、音声情報以外の他の情報であっても良い。このような音源状況情報に基づいて、出力対象となる音源情報に目的や周囲の状況に合わせた適切な処理を施す。 The sound source status information is information that is not directly included in the final output, but is extracted in order to know the sound source and its surrounding information, that is, to be used for the determination of selecting processing in the processing means. . The sound source status information may be information obtained from the sound source information, or may be information other than audio information. Based on such sound source status information, the sound source information to be output is subjected to appropriate processing in accordance with the purpose and surrounding circumstances.

この発明によれば、目的や周囲の状況に合わせて出力対象となる音源情報に適切な処理を施すため、より周囲の状況に沿った、より確実性の高い処理を行うことができ、実用性に優れた音源情報処理装置、音源情報処理方法、及び音源情報処理プログラムを得ることができる。 According to the present invention, since sound source information to be output is appropriately processed in accordance with the purpose and the surrounding situation, it is possible to perform a more reliable process in accordance with the surrounding situation. Sound source information processing apparatus, sound source information processing method, and sound source information processing program can be obtained.

以下に、本発明にかかる音源情報処理装置、音源情報処理方法、及び音源情報処理プログラムについて図面を参照しながら詳細に説明する。なお、本発明は、以下の記述に限定されるものではなく、本発明の要旨を逸脱しない範囲において適宜変更可能である。 Hereinafter, a sound source information processing apparatus, a sound source information processing method, and a sound source information processing program according to the present invention will be described in detail with reference to the drawings. In addition, this invention is not limited to the following description, In the range which does not deviate from the summary of this invention, it can change suitably.

まず、本発明の基本的な概念を説明する。本発明にかかる音源情報処理装置は、複数の音源の音声信号を収音する複数の収音手段と、前記複数の収音手段で収音された音声信号に基づいて少なくとも１つの音源の方向を推定する音源推定手段と、推定した方向の音源のうち目的音源に関する音源情報を抽出する音源情報抽出手段と、抽出した音源情報に対して施す処理内容を選択するための、１つまたは複数の音源と周囲の状況のうち少なくとも一方に関する音源状況情報を取得する音源状況情報取得手段と、音源状況情報に基づいて、抽出された音源情報に対して所定の処理を施す処理手段と、を備えることを特徴とするものである。 First, the basic concept of the present invention will be described. A sound source information processing apparatus according to the present invention includes a plurality of sound collecting means for collecting sound signals of a plurality of sound sources, and a direction of at least one sound source based on the sound signals collected by the plurality of sound collecting means. Sound source estimation means for estimation, sound source information extraction means for extracting sound source information related to the target sound source from among the sound sources in the estimated direction, and one or a plurality of sound sources for selecting processing contents to be performed on the extracted sound source information And sound source status information acquisition means for acquiring sound source status information regarding at least one of the surrounding situations, and processing means for performing predetermined processing on the extracted sound source information based on the sound source status information. It is a feature.

ここで、音源推定手段では、例えば１つまたは複数の音源の方向を推定すること、推定された方向の音声を強調して出力すること、を一般的な機能として備える。この音源推定手段は、例えばマイクロホンアレイによって実現できる。 Here, the sound source estimation means has, as a general function, for example, estimating the direction of one or a plurality of sound sources and emphasizing and outputting the sound in the estimated direction. This sound source estimation means can be realized by a microphone array, for example.

音源情報抽出手段で抽出する音源情報は、音源情報処理装置からの出力に関わる情報であり、音源情報処理装置が出力する目的とする情報である。例えば２つのマイクロホンで音声を収音して、音声の来た方向のみの音声を音源情報処理装置が出力する場合には、推定された音声の到来方向について強調された音声が該当し、音源情報処理装置がロボットや応答システムに適用された場合に、どの方向から話しかけられているかのみを音源情報処理装置が出力する場合には、推定された音声の到来方向の角度が該当する。 The sound source information extracted by the sound source information extraction means is information related to the output from the sound source information processing apparatus, and is the target information output by the sound source information processing apparatus. For example, when sound is collected by two microphones and the sound source information processing apparatus outputs sound only in the direction in which the sound comes, the sound emphasized with respect to the estimated direction of arrival of the sound corresponds to the sound source information. When the processing device is applied to a robot or a response system, when the sound source information processing device outputs only from which direction the speaker is speaking, the angle of the estimated voice arrival direction corresponds.

音源状況情報取得手段が取得する、抽出した音源情報に対して施す処理内容を選択するための音源状況情報は、音源情報処理装置からの最終的な出力には直接は含まれないが音源及びその周囲の情報を知るため、すなわち処理手段での判断に用いられるために抽出される情報である。例えば、音源推定手段によって複数の音源を推定して、その推定された中である一つの音声が音源情報処理装置から出力されるとした場合に、推定された複数の音源のそれぞれの音量を音源状況情報として用いることができる。この音量は音源推定手段で推定した推定音源から容易に得ることができる。 The sound source status information for selecting the processing content to be applied to the extracted sound source information acquired by the sound source status information acquisition means is not directly included in the final output from the sound source information processing apparatus, but the sound source and its This information is extracted in order to know the surrounding information, that is, to be used for determination by the processing means. For example, when a plurality of sound sources are estimated by the sound source estimation means and one of the estimated sounds is output from the sound source information processing apparatus, the respective sound volumes of the estimated plurality of sound sources are It can be used as situation information. This volume can be easily obtained from the estimated sound source estimated by the sound source estimating means.

また、例えば、音源推定手段によって複数の音源を推定して、その推定された中である一つの音声が音源情報処理装置から出力されるとした場合に、推定された複数の音源についてそれぞれ、人間の声らしいかどうかの情報を抽出して音源状況情報として用いることができる。この人間の声らしいかどうかの情報の抽出は、例えば波形のゼロ交差や調波構造を推定するなどの既存の技術で実現できる。 Also, for example, when a plurality of sound sources are estimated by the sound source estimation means and one of the estimated sounds is output from the sound source information processing apparatus, each of the estimated plurality of sound sources Can be extracted and used as sound source status information. Extraction of information about whether or not it seems to be a human voice can be realized by an existing technique such as estimating a zero crossing of a waveform or a harmonic structure.

また、例えば音源推定手段によって複数の音源を推定して、その推定された中である一つの音声が音源情報処理装置から出力されるとした場合に、推定された複数の音声についてそれぞれ、話者認証を行い、この話者認証の結果を音源状況情報として用いることができる。認証には話者ごとのＨＭＭ（Hidden Markov Modeling）のスコアを用いるなど既存の技術を用いることができる。 Further, for example, when a plurality of sound sources are estimated by sound source estimation means and one of the estimated sounds is output from the sound source information processing apparatus, each of the estimated plurality of sounds is Authentication is performed, and the result of speaker authentication can be used as sound source status information. For the authentication, an existing technique such as using a HMM (Hidden Markov Modeling) score for each speaker can be used.

また、処理手段では、音源状況情報取得手段が取得した音源状況情報に基づいて、音源情報抽出手段で抽出された音源情報に対してどのような処理を施すかを判断、選択し、その結果に従って、音源情報抽出手段で抽出された音源情報に対して所定の処理を施す。すなわち、音源情報抽出手段で抽出された音源情報を入力として、この入力に対して所定の処理を施す。そして、この処理手段において所定の処理が施された音源情報が最終的な出力内容とされる。ここでは、入力に対して何の処理も施さないことも所定の処理の一形態としてもよい。その場合には、処理手段に入力された情報がそのままの状態で出力されることになる。 The processing means determines and selects what kind of processing is to be performed on the sound source information extracted by the sound source information extraction means based on the sound source status information acquired by the sound source status information acquisition means, and according to the result The sound source information extracted by the sound source information extracting means is subjected to a predetermined process. That is, the sound source information extracted by the sound source information extraction means is used as an input, and predetermined processing is performed on this input. The sound source information that has been subjected to the predetermined processing in this processing means is the final output content. Here, no processing may be performed on the input, or a form of predetermined processing may be used. In that case, the information input to the processing means is output as it is.

また、例えば音源状況情報に基づいて抽出された音源情報に対して施す処理内容を保持する保持手段を装置が備え、処理手段がこの保持手段から音源状況情報に基づいて所定の処理手続を呼び出して処理を実行することができる。この保持手段には、例えば入力がある閾値以上である場合にはこの処理、入力が閾値未満である場合には他の処理というように、表引きやツリー構造（例えば一例として、「男性ですか？」→「年齢は？」→…のように、判断条件が階層的に構成されている構造）の情報がテーブルとして保持されており、このテーブルを用いることにより処理手段における入力と出力との対応付けを行うことができる。 In addition, for example, the apparatus includes holding means for holding processing contents to be performed on the sound source information extracted based on the sound source status information, and the processing means calls a predetermined processing procedure based on the sound source status information from the holding means. Processing can be executed. In this holding means, for example, when the input is above a certain threshold value, this processing is performed, and when the input is less than the threshold value, other processing is performed. ? "→" What is the age? "→ ..., etc., the information of the hierarchical structure of the judgment conditions is held as a table, and by using this table, the input and output in the processing means are Association can be performed.

このような処理手段での判断としては、上述したように推定された複数の音源のそれぞれの音量を音源状況情報として用いる場合には、例えばその複数の音源の音量に注目して、出力する予定の推定音源の音量が他の音源の音量より一定値以上大きければ、音源推定には有利な環境であるとして推定音源の情報をそのまま出力し、そうでない場合には音源推定が困難であるとして予め規定された初期値方向の音源情報のみを出力する、というような処理を施すことができる。このような処理を行うことにより、音源推定が容易な場合は推定情報をそのまま使用して出力し、音源推定が困難な場合は音源推定のミスがあってもシステム全体として確実な動作を行えるようにすることができる。 As the determination by such processing means, when using the sound volumes of the plurality of sound sources estimated as described above as sound source status information, for example, paying attention to the sound volumes of the plurality of sound sources, the output is scheduled. If the volume of the estimated sound source is higher than the volume of the other sound sources by a certain value or more, the estimated sound source information is output as it is as an advantageous environment for sound source estimation, otherwise the sound source estimation is difficult. It is possible to perform processing such as outputting only sound source information in the defined initial value direction. By performing such processing, if the sound source estimation is easy, the estimation information is used as it is, and if the sound source estimation is difficult, the entire system can operate reliably even if there is a sound source estimation error. Can be.

また、上述したように推定された音声の人間の声らしさを音源状況情報として用いる場合には、入力に対して例えば以下のような処理を施すことができる。人間の声らしさの情報について各候補に対して一定時間保存できるものとし、この人間の声らしさの度合いが所定の閾値を越え、且つ所定の時間の閾値を越えて継続される候補が一つしかない場合には、音源推定手段の音源推定の第一候補がその方向の音声でなくてもその候補を出力する。 Further, in the case where the estimated human voice like the sound as described above is used as the sound source status information, for example, the following processing can be performed on the input. It is assumed that human voice-like information can be stored for each candidate for a certain period of time, and that there is only one candidate whose human voice-like degree exceeds a predetermined threshold and continues beyond a predetermined time threshold. If not, the candidate is output even if the first candidate for sound source estimation by the sound source estimating means is not the voice in that direction.

一方、人間の声らしさの度合いが所定の閾値を越える音源が一定個数以上ある場合には、初期値方向の音源情報のみを出力するようにする。そして、上記のどちらにも当てはまらない場合には、音源推定手段の音源推定の第一候補を出力する。このような処理を行うことにより、音源推定手段が音源推定を誤った場合でも、極力人間の音声を捉えられるようにすることができる。そして人間の声らしい音源が複数ある場合には、無理に一つの音声を推定せずに特定の動作（例えば予め規定された初期値方向の音源情報のみを出力する）を行うことにより、音源情報処理装置全体の挙動に一定の規則性を付与することが可能となり、且つその挙動を理解し易くなる。 On the other hand, when there are more than a certain number of sound sources whose human voiceness exceeds a predetermined threshold, only the sound source information in the initial value direction is output. If neither of the above applies, the first candidate for sound source estimation of the sound source estimation means is output. By performing such processing, it is possible to capture human speech as much as possible even when the sound source estimation means makes a mistake in sound source estimation. When there are a plurality of sound sources that are likely to be human voices, the sound source information is obtained by performing a specific operation (for example, outputting only sound source information in a predetermined initial value direction) without forcibly estimating one sound. It becomes possible to give a certain regularity to the behavior of the entire processing apparatus, and it becomes easy to understand the behavior.

また、上述したように話者認証の結果を音源状況情報として用いる場合には、入力に対して例えば以下のような処理を施すことができる。例えば音源情報処理装置が対話装置に用いられている場合、音源推定された音源が現在話しているはずの話者であると認証された場合にはその推定された音源情報を出力し、音源推定された音源が現在話しているはずの話者ではないと判断された場合には無音を返すなどしてその音声を出力しないようにする。このような処理を行うことにより、周りの雑音に対して誤って音源推定してしまうような場合でも、目的の音源以外には反応しないようにすることができる。 Further, as described above, when the result of speaker authentication is used as the sound source state information, for example, the following processing can be performed on the input. For example, when a sound source information processing device is used in a dialogue device, if it is authenticated that the sound source estimated is a speaker who is currently speaking, the estimated sound source information is output and the sound source estimation is performed. If it is determined that the selected sound source is not the speaker who is currently speaking, the sound is not output by returning silence or the like. By performing such processing, even when the sound source is erroneously estimated with respect to the surrounding noise, it is possible to prevent reaction to other than the target sound source.

また、推定音源方向の音声を出力するシステムとして用いた場合に、推定された音源の方向によって出力の音量を変えることも可能である。例えば音源方向が正面であると推定された音源の音量が強められて出力され、推定された音源方向が横方向にずれるにつれて出力の音量を小さくなるようにする。音源情報処理装置からの出力が推定音源からの音声そのままだけである場合にはどの方向から音が来たのかがわからなくなるが、上記のような形態とすることでシステムが左右に向きを変えれば同じ音源でも音量が変わる。これにより、例えばこのシステムをロボットに適用した場合には人間の聴覚に近い状況を実現することができ、音源方向に関する情報を後段のシステムで知覚することができる。 In addition, when used as a system that outputs sound in the estimated sound source direction, the output volume can be changed depending on the estimated sound source direction. For example, the sound volume of the sound source whose sound source direction is estimated to be in front is increased and output, and the sound volume of the output is reduced as the estimated sound source direction shifts in the horizontal direction. If the output from the sound source information processing device is just the sound from the estimated sound source, it will not be clear from which direction the sound came from. However, if the system is turned left and right by using the above configuration, The volume changes even with the same sound source. Thereby, for example, when this system is applied to a robot, a situation close to human hearing can be realized, and information on the direction of the sound source can be perceived by a subsequent system.

また、推定音源方向の音声を出力するシステムとして用いた場合には、ある方向の音声強調を行って出力することができる。すなわち、処理手段において、ある方向の音声を収音して出力すると判断された場合には、処理手段において複数のマイクロホンからの入力に遅延和処理を行うなどして、推定方向の音声を得てそれを出力する。例えば、音源推定手段において１０度方向に音源があると推定されて、その方向の音声を出力しようとしても、音源に関する音源状況情報からその出力はふさわしくないと判断して９０度方向の音声を出力すると判断したとする。この場合は処理手段においてマイクロホンからの音声に遅延和処理等を行うことで、該当方向の音声を強調して出力する。これにより、音源推定手段における音源推定に推定誤りが生じやすい状況下においても目的に合わせた出力を行うことができる。 Further, when used as a system for outputting sound in the estimated sound source direction, it is possible to perform sound enhancement in a certain direction and output it. That is, when the processing means determines that the sound in a certain direction is collected and output, the processing means obtains the sound in the estimated direction by performing delay sum processing on the inputs from the plurality of microphones. Output it. For example, if the sound source estimation means estimates that there is a sound source in the direction of 10 degrees and tries to output sound in that direction, it determines that the output is not appropriate from the sound source status information about the sound source and outputs sound in the direction of 90 degrees Suppose that it is determined. In this case, the processing means performs delay sum processing or the like on the sound from the microphone, and emphasizes and outputs the sound in the corresponding direction. As a result, output suitable for the purpose can be performed even under a situation where an estimation error is likely to occur in the sound source estimation in the sound source estimation means.

（第１の実施の形態）
次に、本発明にかかる音源情報処理装置、音源情報処理方法、及び音源情報処理プログラムについて、具体的な実施の形態に基づいてより詳細に説明する。 (First embodiment)
Next, the sound source information processing apparatus, the sound source information processing method, and the sound source information processing program according to the present invention will be described in more detail based on specific embodiments.

図１は、本発明を適用した音源情報処理システムの一構成例を示すブロック図である。図１に示した本発明を適用した音源情報処理システム１００は、Ｎ個のマイクロホン（Ｎ≧２）１０１−１〜１０１−Ｎと、入力処理部１０２と、マイクロホンアレイ処理部１０４と音源情報処理部１０５とを有する信号処理部１０３と、を備えて構成されている。ここで、音源情報処理部１０５は、マイクロホンアレイ処理部１０４で推定した音源のうち音源情報処理システム１００から出力する出力情報、すなわち音源情報処理システム１００が出力する目的とする音源情報を抽出する音源情報抽出手段（機能）１０８と、抽出した音源情報に対して施す処理内容を選択するための音源状況情報を取得する音源状況情報取得手段（機能）１０９と、音源状況情報に基づいて、抽出された音源情報に対して所定の処理を施す処理手段（機能）１１０と、を有するものである。 FIG. 1 is a block diagram showing a configuration example of a sound source information processing system to which the present invention is applied. A sound source information processing system 100 to which the present invention shown in FIG. 1 is applied includes N microphones (N ≧ 2) 101-1 to 101-N, an input processing unit 102, a microphone array processing unit 104, and a sound source information processing. And a signal processing unit 103 having a unit 105. Here, the sound source information processing unit 105 extracts the output information output from the sound source information processing system 100 out of the sound sources estimated by the microphone array processing unit 104, that is, the sound source for extracting the target sound source information output by the sound source information processing system 100. Information extraction means (function) 108, sound source status information acquisition means (function) 109 for acquiring sound source status information for selecting processing contents to be applied to the extracted sound source information, and sound source status information are extracted. And processing means (function) 110 for performing predetermined processing on the sound source information.

このように構成された音源情報処理システム１００においては、まずＮ個のマイクロホン１０１−１〜１０１−Ｎにより音声が収音され、入力処理部１０２に入力される。このマイクロホンは特に限定されるものではなく、音声を確実に収音可能なものであれば通常用いられるマイクロホンを用いることができる。また、このＮ個のマイクロホンとしては、すべて同じ種類のマイクロホンを使用しても良く、また、周波数特性などが計測されることで後段の信号処理で収音特性の違いを吸収することが可能ならば、異なる種類のマイクロホンを使用することも可能である。 In the sound source information processing system 100 configured as described above, first, sound is collected by N microphones 101-1 to 101-N and input to the input processing unit 102. The microphone is not particularly limited, and a commonly used microphone can be used as long as sound can be reliably collected. The N microphones may all be the same type of microphone, and if the frequency characteristics and the like are measured, the difference in sound collection characteristics can be absorbed by subsequent signal processing. For example, different types of microphones can be used.

マイクロホン１０１−１〜１０１−Ｎで収音されたこれらの音声は、入力処理部１０２において増幅回路やＡ／Ｄコンバータ等によってデジタル信号処理可能な形式に変換される。入力処理部１０２は、マイクロホンで収音された音声をデジタル信号処理可能な形に変換することが可能であれば、その構成は特に限定されるものではなく、既存の電子回路や、マイクロホンアレイの入力システム等の種々の手段により実現可能である。また、マイクロホン１０１−１〜１０１−Ｎと入力処理部１０２との信号の伝達は有線であっても無線であっても構わない。これについても入力処理部１０２に信号が伝達可能であれば特に限定はなく、既存の信号伝達技術で実現可能である。 These sounds collected by the microphones 101-1 to 101 -N are converted into a format that can be processed by a digital signal by an input circuit 102 by an amplifier circuit, an A / D converter, or the like. The configuration of the input processing unit 102 is not particularly limited as long as the sound collected by the microphone can be converted into a form that can be processed with a digital signal. The configuration of the input processing unit 102 is not limited to an existing electronic circuit or microphone array. It can be realized by various means such as an input system. Further, transmission of signals between the microphones 101-1 to 101-N and the input processing unit 102 may be wired or wireless. This is not particularly limited as long as a signal can be transmitted to the input processing unit 102, and can be realized by an existing signal transmission technique.

入力処理部１０２においてデジタル信号処理可能な形式に変換された音声信号は、信号処理部１０３に入力される。信号処理部１０３においては次のような信号処理がなされる。まず、マイクロホンアレイ処理部１０４においてマイクロホンアレイ処理が行われる。マイクロホンアレイ処理では、音源情報処理システム１００の使用目的によるが、音源定位を行ったり、さらにその定位方向の音声を強調したり、あるいは指定方向の音声を強調して他の方向からの音声を抑圧したりすることができる。すなわち、マイクロホンアレイ処理部１０４では、音源情報処理システム１００で必要とする特徴量を得るための信号処理が行われ、特に複数マイクロホンで収音された音声信号から音源情報を得るための処理が行われる。 The audio signal converted into a format that can be processed by the input processing unit 102 is input to the signal processing unit 103. The signal processing unit 103 performs the following signal processing. First, microphone array processing is performed in the microphone array processing unit 104. In the microphone array processing, depending on the purpose of use of the sound source information processing system 100, sound source localization is performed, sound in the localization direction is further emphasized, or sound in a specified direction is emphasized to suppress sound from other directions. You can do it. That is, the microphone array processing unit 104 performs signal processing for obtaining a feature amount required by the sound source information processing system 100, and in particular, performs processing for obtaining sound source information from audio signals collected by a plurality of microphones. Is called.

そしてマイクロホンアレイ処理部１０４からの情報をうけて音源情報処理部１０５では、音源情報抽出機能によりマイクロホンアレイ処理部１０４からの音源情報のうち音源情報処理システムから出力する目的音源に関する音源情報を抽出し、音源状況情報取得機能により前記の抽出した音源情報に対して施す処理内容を選択するための音源状況情報を取得し、処理機能により音源状況情報に基づいて、前記の抽出された音源情報に対して所定の処理を施して、音源情報処理システムの出力対象に合わせた情報を所定の形式で出力１０６する。 In response to the information from the microphone array processing unit 104, the sound source information processing unit 105 extracts sound source information related to the target sound source output from the sound source information processing system from the sound source information from the microphone array processing unit 104 by the sound source information extraction function. The sound source status information for selecting the processing content to be applied to the extracted sound source information by the sound source status information acquisition function is acquired, and the extracted sound source information is obtained based on the sound source status information by the processing function. Then, a predetermined process is performed, and information 106 according to the output target of the sound source information processing system is output 106 in a predetermined format.

これらのマイクロホンアレイ処理部１０４及び音源情報処理部１０５を含む信号処理部１０３は、上述した処理が可能な形態であれば特に限定されるものではなく既存の信号処理チップ、電子回路等を用いて実現することが可能であり、また、デジタル化された信号をコンピュータ内でデジタル信号処理を行う形態とすることもできる。さらにコンピュータ内のプロセスによって実現することも可能である。 The signal processing unit 103 including the microphone array processing unit 104 and the sound source information processing unit 105 is not particularly limited as long as the above-described processing can be performed, and an existing signal processing chip, electronic circuit, or the like is used. The digital signal processing can be performed in a computer. Further, it can be realized by a process in a computer.

上述した入力処理部１０２、信号処理部１０３、マイクロホンアレイ処理部１０４、音源情報処理部１０５においては、本発明の機能において個別に名前を付けたが、実現形態においてはプログラムのプロセスであったり、ネットワーク上で情報の授受を可能とする関係において分散配置されたりしても構わないものとする。また機能において名前付けられた個々の処理部の調整を行うために、音源情報処理システムの上位のシステムで全体の処理を管理してもよい。また、本明細書においては、上記の音源情報処理システム１００において、マイクロホンアレイ処理部１０４と音源情報処理部１０５とを有する信号処理部１０３を音源情報処理装置とする。 In the input processing unit 102, the signal processing unit 103, the microphone array processing unit 104, and the sound source information processing unit 105 described above, names are individually given in the function of the present invention. It may be distributed in such a relationship that information can be exchanged on the network. Further, in order to adjust the individual processing units named in the function, the entire processing may be managed by a system higher than the sound source information processing system. In the present specification, in the sound source information processing system 100, the signal processing unit 103 including the microphone array processing unit 104 and the sound source information processing unit 105 is referred to as a sound source information processing apparatus.

次に、上述した音源情報処理システム１００を自動車２０１に適用した場合の具体的な実施の形態について説明する。図２は本発明を適用した音源情報処理システムを自動車内において適用した形態例について説明する図である。自動車２０１内の運転席２０２ａと助手席２０３ａとにはそれぞれ運転席のユーザ２０２及び助手席のユーザ２０３が居る。また、自動車２０１内の天井部分には、マイクロホンが一体化された音源情報処理システム２０４が配置されている。そして、音源情報処理システム２０４は車載情報システム２０５と情報伝達を行って所定の処理を行うことができる。音源情報処理システム２０４と車載情報システム２０５との情報伝達は無線により行っても良く、また有線により行っても良い。また、車載情報システム２０５は自動車の操作系２０６からも各種の情報を得ることが可能であり、これにより車載情報システム２０５では自動車２０１の運転状況を知ることができ、さらにこの運転状況の情報を音源情報処理システム２０４に伝達することも可能である。なお、音源情報処理システム２０４の詳細については、上記において説明した図１に示す音源情報処理システム１００と同じ構成を有するものである。 Next, a specific embodiment when the above-described sound source information processing system 100 is applied to the automobile 201 will be described. FIG. 2 is a diagram for explaining an embodiment in which a sound source information processing system to which the present invention is applied is applied in an automobile. A driver seat 202a and a passenger seat 203a in the automobile 201 include a driver seat user 202 and a passenger seat user 203, respectively. Further, a sound source information processing system 204 in which a microphone is integrated is disposed on the ceiling portion in the automobile 201. The sound source information processing system 204 can perform predetermined processing by communicating information with the in-vehicle information system 205. Information transmission between the sound source information processing system 204 and the in-vehicle information system 205 may be performed wirelessly or by wire. The in-vehicle information system 205 can also obtain various types of information from the operation system 206 of the automobile, whereby the in-vehicle information system 205 can know the driving situation of the automobile 201, and further, this driving situation information can be obtained. It can also be transmitted to the sound source information processing system 204. The details of the sound source information processing system 204 have the same configuration as that of the sound source information processing system 100 shown in FIG. 1 described above.

ここでは、ユーザ２０２、２０３と車載情報システム２０５とは音源情報処理システム２０４を介して音声で情報のやり取りができるものとする。ユーザ２０２、２０３の音声による命令は、周囲の雑音や車載情報システム２０５と対話中以外の話者からの発声による影響を受ける可能性がある。そして、このような影響を受けた音声がそのまま伝えられた場合には車載情報システム２０５において十分な性能が得られないこと以外に、想定外の動作や結果として全く用をなさなくなってしまう虞もある。すなわち、周囲の環境によっては車載情報システム２０５が正常に機能できない虞がある。 Here, it is assumed that the users 202 and 203 and the in-vehicle information system 205 can exchange information by voice via the sound source information processing system 204. The voice commands of the users 202 and 203 may be affected by ambient noise or utterances from speakers other than those in conversation with the in-vehicle information system 205. And if the voice affected by such influence is transmitted as it is, the in-vehicle information system 205 cannot obtain sufficient performance, and there is a possibility that it will not be used at all as an unexpected operation or result. is there. In other words, the in-vehicle information system 205 may not function properly depending on the surrounding environment.

そこで、音源情報処理システム２０４において、現在の音声による命令がどこからの発声かということを正確に処理できれば、適切な話者からの発声を収音することができ、その結果を車載情報システム２０５に送ることが可能となる。このような音源に関する情報とその利用については例を挙げて後述する。 Therefore, if the sound source information processing system 204 can accurately process the utterance from which the current voice command is uttered, the utterance from an appropriate speaker can be collected, and the result is sent to the in-vehicle information system 205. It becomes possible to send. Information on such a sound source and its use will be described later with an example.

ここで、図３に示したフローチャートを用いて音源情報処理システム２０４における処理の流れを説明する。まず、ユーザ２０２、２０３と車載情報システム２０５との対話が開始されると（Ｓ３０１）、音源情報処理システム２０４は予め決められたＮ本（２本以上）のマイクロホンが車内音を収音する（Ｓ３０２）。収音された音声の信号は入力処理部１０２に入力される。入力処理部１０２ではマイクロホンからの音声信号を取り込んで、これをデジタル信号処理可能な形式に変換する（Ｓ３０３）。入力処理部１０２における受信信号の増幅については、音源情報処理システム２０４を構成した際に、どのマイクロホンも一定の方向に一定の音量があるときに同程度の音量が得られるように校正されているとする。あるいは、所定の操作により、システムやユーザが任意に校正できるようにしてもよい。校正法自体は本発明とは直接に関係ないので、詳細は省略する。 Here, the flow of processing in the sound source information processing system 204 will be described using the flowchart shown in FIG. First, when the dialogue between the users 202 and 203 and the in-vehicle information system 205 is started (S301), the sound source information processing system 204 collects in-vehicle sound by N (two or more) predetermined microphones ( S302). The collected audio signal is input to the input processing unit 102. The input processing unit 102 takes in an audio signal from the microphone and converts it into a format that allows digital signal processing (S303). The amplification of the received signal in the input processing unit 102 is calibrated so that when the sound source information processing system 204 is configured, any microphone can obtain the same sound volume when there is a constant sound volume in a certain direction. And Alternatively, the system or the user may arbitrarily calibrate by a predetermined operation. Since the calibration method itself is not directly related to the present invention, the details are omitted.

次に、入力処理部１０２においてデジタル信号処理可能な形式に変換された音声信号は信号処理部１０３に入力される。このようにして信号処理部１０３に入力された音声信号からは、マイクロホンアレイ処理部１０４での信号処理により、音源情報処理システム２０４の目的に沿った音源の推定が行われる（Ｓ３０４）。 Next, the audio signal converted into a format that can be processed by the input processing unit 102 is input to the signal processing unit 103. From the sound signal input to the signal processing unit 103 in this manner, the sound source is estimated according to the purpose of the sound source information processing system 204 by the signal processing in the microphone array processing unit 104 (S304).

マイクロホンアレイ処理部１０４では、受信した信号の位相差や音量などによって、音源位置を定位したり、目的方向の音声を収音したりすることができる。マイクロホンアレイ処理部１０４での信号処理、すなわちマイクロホンアレイ処理は以降の実施の形態で説明するように、目的に合った音源情報を得られるものなら、実現手法自体は本発明では問わない。非特許文献１の公知文献で挙げられているような、遅延和アレーや適応形アレーによる手法で実現できるものとする。ここではその詳細は省略する。また、Ｎ本のマイクの配置や向きについても同様に、本発明については目的に合わせて対応可能であればよいとする。ここでは、マイクロホンは２本で特許文献２に従ったマイクロホンアレイ処理を行ったものとして説明する。この場合、２本のマイクロホンに対して適応ビームフォーマを２つ構成して、一方が目的音源、他方が雑音源を抑圧するようにお互いに制御するような構成となる。 The microphone array processing unit 104 can determine the position of the sound source or pick up the sound in the target direction based on the phase difference or volume of the received signal. The signal processing in the microphone array processing unit 104, that is, the microphone array processing is not limited in the present invention as long as sound source information suitable for the purpose can be obtained as described in the following embodiments. It can be realized by a technique such as a delay-and-sum array or an adaptive array as mentioned in the publicly known document of Non-Patent Document 1. The details are omitted here. Similarly, regarding the arrangement and orientation of the N microphones, the present invention is only required to be able to cope with the purpose. Here, a description will be given assuming that two microphones are subjected to microphone array processing according to Patent Document 2. In this case, two adaptive beamformers are configured for two microphones, and one is controlled so that one suppresses the target sound source and the other suppresses the noise source.

そして、マイクロホンアレイ処理部１０４で推定された音源の情報は、音源情報処理部１０５に入力される。音源情報処理部１０５では、マイクロホンアレイ処理部１０４で推定した音源のうち音源情報処理システム１００からの出力対象とする目的音源の音源情報を抽出し、抽出した音源情報に対して施す処理内容を選択するための音源状況情報を取得する。この音源状況情報は、マイクロホンアレイ処理部１０４で推定された音源の情報、他の音源の情報、または音源の情報以外の他の情報など、種々の情報を用いることができる。そして、この音源状況情報に基づいて、推定された音源情報に対して所定の処理を施す必要があるか否か、どのような処理を施すかを判断する。この判断についての詳細と、処理内容については後述する。 The sound source information estimated by the microphone array processing unit 104 is input to the sound source information processing unit 105. The sound source information processing unit 105 extracts the sound source information of the target sound source to be output from the sound source information processing system 100 from the sound sources estimated by the microphone array processing unit 104, and selects the processing content to be applied to the extracted sound source information Sound source status information is acquired. As the sound source status information, various kinds of information such as information on a sound source estimated by the microphone array processing unit 104, information on another sound source, or information other than the information on the sound source can be used. Then, based on the sound source status information, it is determined whether or not a predetermined process needs to be performed on the estimated sound source information and what kind of process is performed. Details of this determination and details of processing will be described later.

この判断結果を受けて、抽出した音源情報に対して所定の処理を施す必要があると判断された場合には（Ｓ３０５肯定）、音源情報処理部１０５では該音源情報に対して所定のルールに従って処理を施す（Ｓ３０６）。そして、所定の処理が施された音源情報を音源情報処理システム２０４からの最終的な出力情報として所定の出力形式に変換して（Ｓ３０７）、車載情報システム２０５に対して出力して（Ｓ３０８）一連の処理が終了する。ここで音源情報処理システム２０４からの出力形式は、目的によって適宜変更することが可能である。 When it is determined that it is necessary to perform predetermined processing on the extracted sound source information in response to the determination result (Yes in S305), the sound source information processing unit 105 performs a predetermined rule on the sound source information. Processing is performed (S306). Then, the sound source information subjected to the predetermined processing is converted into a predetermined output format as final output information from the sound source information processing system 204 (S307), and is output to the in-vehicle information system 205 (S308). A series of processing ends. Here, the output format from the sound source information processing system 204 can be appropriately changed depending on the purpose.

一方、抽出した音源情報に対して所定の処理を施す必要がないと判断された場合には（Ｓ３０５否定）、抽出した音源情報をそのまま音源情報処理システム２０４からの最終的な出力情報として所定の出力形式に変換して（Ｓ３０７）、車載情報システム２０５に対して出力して（Ｓ３０８）一連の処理が終了する。 On the other hand, if it is determined that it is not necessary to perform predetermined processing on the extracted sound source information (No at S305), the extracted sound source information is used as the final output information from the sound source information processing system 204 as the predetermined output information. It converts into an output format (S307), it outputs with respect to the vehicle-mounted information system 205 (S308), and a series of processes are complete | finished.

本実施の形態での説明においては、音源情報処理システム２０４からの出力は目的音源と推定した方向の音声とする。そして、以上の処理は一定時間ごと、あるいは任意の時間ごとに処理することができるものとする。また、ここで説明する実施の形態においての出力は連続的なものとする。すなわち、ある単位時間ごとの音源を推定し、その単位時間の間は目的音源方向の音声が収音、処理されて、音源情報処理システム２０４の出力とされる。 In the description of the present embodiment, the output from the sound source information processing system 204 is assumed to be the sound in the direction estimated as the target sound source. The above processing can be performed at regular time intervals or at arbitrary time intervals. The output in the embodiment described here is assumed to be continuous. That is, a sound source for each unit time is estimated, and during the unit time, sound in the target sound source direction is picked up and processed and used as an output of the sound source information processing system 204.

次に、前述の抽出された音源情報に対して所定の処理を施す必要があるか否かの判断、及びその処理について説明する。以下ではマイクロホンアレイ処理部１０４の出力を、車内でのユーザの発声を定位して、その方向に指向性を当てて収音した結果とした場合について図４及び図５を用いて説明する。また、２つのビームフォーマが、一方は雑音方向を抑圧して目的音声を収音し、他方が目的音源方向を抑圧して雑音を収音するものとする。そして目的音源ではない雑音の強さを測定し、所定の閾値を設定して目的音源以外の周囲の雑音が強いか否かを判断し、抽出された音源情報（目的音源の情報）に所定の処理を施すか否かを判断するものとする。ここで、図４は、雑音の強さを音源状況情報とした場合の判断条件と音源情報に対する処理内容とを関連付ける手続情報をまとめたテーブルである。このテーブルは、例えば音源情報処理部１０５内の情報保持手段１０７に保持されており、この情報が音源情報処理部１０５により適宜呼び出され、所定の処理が行われる。また図５は、抽出された音源情報に対して所定の処理を施す必要があるか否かの判断、及びその処理の流れを示すフローチャートである。 Next, determination of whether or not a predetermined process needs to be performed on the extracted sound source information and the process will be described. In the following, the output of the microphone array processing unit 104 will be described with reference to FIGS. 4 and 5 in the case where the user's utterance in the vehicle is localized and the sound is collected by directing the direction. In addition, it is assumed that one of the two beamformers collects the target speech by suppressing the noise direction, and the other collects the noise by suppressing the direction of the target sound source. Then, measure the intensity of the noise that is not the target sound source, set a predetermined threshold value to determine whether or not the surrounding noise other than the target sound source is strong, and add the predetermined sound source information (target sound source information) to the predetermined sound source. It is determined whether or not processing is to be performed. Here, FIG. 4 is a table in which procedure information for associating the determination condition when the noise intensity is the sound source state information and the processing content for the sound source information. This table is held in, for example, the information holding unit 107 in the sound source information processing unit 105, and this information is appropriately called by the sound source information processing unit 105 to perform predetermined processing. FIG. 5 is a flowchart showing a determination as to whether or not a predetermined process needs to be performed on the extracted sound source information, and the process flow.

まず、音源情報処理部１０５が所定の処理を施す必要があるか否かの音源状況情報として、目的音源ではない雑音の強さを取得する。そして雑音の強さが所定の閾値よりも低い値となっているか否かを判断する（Ｓ５０１）。ここで、雑音の強さが所定の閾値よりも低いと判断された場合は（Ｓ５０１肯定）、抽出された音源情報はユーザからの発声とみなして所定の処理は不要と判断され（Ｓ５０２）、抽出された音源情報がそのまま音源情報処理システム２０４からの最終的な出力となる。したがって、抽出した音源情報をそのまま所定の出力形式に変換して（Ｓ３０７）、車載情報システム２０５に対して出力して（Ｓ３０８）一連の処理が終了する。その結果、その出力を受けた車載情報システム側では、自動車内の任意の位置でのユーザの発声を収音することができ、その内容を音声認識する、ハンズフリー電話に利用するなどの音声利用をすることができる。 First, as the sound source status information indicating whether or not the sound source information processing unit 105 needs to perform predetermined processing, the intensity of noise that is not the target sound source is acquired. Then, it is determined whether or not the noise intensity is lower than a predetermined threshold (S501). Here, when it is determined that the noise intensity is lower than the predetermined threshold (Yes in S501), it is determined that the extracted sound source information is uttered from the user and the predetermined process is unnecessary (S502). The extracted sound source information becomes the final output from the sound source information processing system 204 as it is. Therefore, the extracted sound source information is directly converted into a predetermined output format (S307) and output to the in-vehicle information system 205 (S308), and a series of processing ends. As a result, the in-vehicle information system that receives the output can pick up the user's utterance at any location in the car, use the voice for voice recognition, use for hands-free phone calls, etc. Can do.

一方、雑音の強さが所定の閾値よりも高いと判断された場合は（Ｓ５０１否定）、周囲の雑音が強くて目的音声の音源定位などの推定精度が落ちる状況である。特に自動車においては、高速走行中や外の天気が悪い場合などは、音声情報を利用するには困難な状況であると予想される。このような場合に推定された目的音源情報をそのまま出力すると、推定が困難なために誤った推定結果の出力が音源情報処理システムから出ることで、後段の車載情報システム２０５での処理に悪い影響が出る虞がある。具体的には、高速走行中の対向車とのすれ違い音をユーザの発声と誤推定して車載情報システムに送ってしまい、車載情報システムではそれを音声認識してユーザの意図しない命令を解釈してしまうことなどが考えられる。 On the other hand, if it is determined that the strength of the noise is higher than the predetermined threshold (No in S501), the surrounding noise is strong and the estimation accuracy such as the sound source localization of the target speech is reduced. Particularly in automobiles, it is expected that it is difficult to use voice information when traveling at high speed or when the weather outside is bad. If the target sound source information estimated in such a case is output as it is, it is difficult to estimate, and an output of an erroneous estimation result is output from the sound source information processing system, which adversely affects the processing in the in-vehicle information system 205 at the subsequent stage. May occur. Specifically, a passing sound with an oncoming vehicle running at high speed is erroneously estimated as a user's utterance and sent to the in-vehicle information system, and the in-vehicle information system interprets the voice and interprets a command not intended by the user. It is possible that

そこで、雑音の強さを判断した結果、雑音の強さが所定の閾値よりも高いと判断された場合は（Ｓ５０１否定）、抽出された音源情報に対する所定の処理が必要と判断され（Ｓ５０３）、推定した目的音声の音源情報を無視して運転席方向のみに目的音声のビームフォーマを向け、運転席方向の音声のみを収音、抽出する処理を施す（Ｓ５０４）。そして、前記の処理が施された音源情報を音源情報処理システム２０４からの最終的な出力情報として所定の出力形式に変換して（Ｓ３０７）、車載情報システム２０５に対して出力して（Ｓ３０８）一連の処理が終了する。 Therefore, as a result of determining the noise intensity, if it is determined that the noise intensity is higher than a predetermined threshold (No in S501), it is determined that a predetermined process is required for the extracted sound source information (S503). Then, the sound source information of the estimated target voice is ignored, the beamformer of the target voice is directed only in the driver seat direction, and only the voice in the driver seat direction is collected and extracted (S504). Then, the sound source information subjected to the above-described processing is converted into a predetermined output format as final output information from the sound source information processing system 204 (S307) and output to the in-vehicle information system 205 (S308). A series of processing ends.

このような処理を施すことにより、運転席方向以外の方向からの雑音で、車載情報システム２０５が誤動作する可能性が大幅に減少する。もちろん、運転席方向以外から、ユーザの音声が入る可能性もあるが、雑音の強いような状況、前述のような高速走行などのような場合では運転席以外からの音声入力の必要性はあまり考えられない。そしてそれ以上に、雑音が強い状況で音源情報に所定の処理を施すことにより、音源推定誤りによる車載情報システムの誤動作の可能性を減少させることと、運転席からの音声は少なくとも確実に収音するという効果を得ることができる。以上のように出力対象となる音源情報に所定の処理を施すことにより、より周囲の状況に沿った、より実用性の高い処理を行うことができるという効果が得られる。 By performing such processing, the possibility that the in-vehicle information system 205 malfunctions due to noise from directions other than the driver's seat direction is greatly reduced. Of course, there is a possibility that the user's voice comes in from other than the direction of the driver's seat, but there is not much need for voice input from other than the driver's seat in situations where there is a strong noise, such as high speed driving as described above. Unthinkable. In addition, by applying predetermined processing to the sound source information in a noisy situation, the possibility of malfunction of the in-vehicle information system due to a sound source estimation error is reduced, and the sound from the driver's seat is at least reliably collected. The effect of doing can be obtained. As described above, by performing predetermined processing on the sound source information to be output, it is possible to obtain an effect that more practical processing can be performed according to the surrounding situation.

また、音源情報処理システム２０４から、推定音源方向の音声（または所定の処理が施された特定方向の音声）を出力する場合に、処理による時間差を考慮しなければならない場合が考えられる。例えば、時刻０で音が入ってきたとして、音源推定に１秒かかり、音源に関する情報を抽出して判断するのに２秒かかるとする。この判断によって、ある方向の音声を強調して出力するとした場合、（１秒＋２秒）後の音声からしか反映されないことになってしまう。したがって、このような場合には信号処理部１０３にリングバッファを配置し、Ｎ本のマイクからの情報を適当な形式で（１秒＋２秒）分以上の音声を保存できるようにしておき、保存した音声に対して処理を行うことで、本発明に対応する出力を得ることができる。 Further, when outputting sound in the estimated sound source direction (or sound in a specific direction subjected to predetermined processing) from the sound source information processing system 204, there may be a case where a time difference due to processing must be taken into consideration. For example, it is assumed that it takes 1 second to estimate a sound source and that it takes 2 seconds to extract and determine information related to the sound source, assuming that sound enters at time 0. If it is determined that the voice in a certain direction is emphasized and output, it is reflected only from the voice after (1 second + 2 seconds). Therefore, in such a case, a ring buffer is arranged in the signal processing unit 103 so that information from N microphones can be saved in an appropriate format (1 second + 2 seconds) or more. By performing the processing on the voice that has been processed, an output corresponding to the present invention can be obtained.

図６は、実施の形態１に係る音源情報処理システム２０４のハードウェア構成を示す図である。音源情報処理システム２０４は、ハードウェア構成として、音源情報処理システム２０４におけるマイクロホンアレイ処理を実行するマイクロホンアレイ処理プログラムや音源情報処理部１０５における上述した音源情報抽出機能、音源状況情報取得機能、処理機能にかかる処理を実行するプログラムなどが格納されているＲＯＭ２５２と、ＲＯＭ２５２内のプログラムに従って音源情報処理システム２０４の各部を制御し、バッファリング時間変更処理等を実行するＣＰＵ２５１と、ワークエリアが形成され、音源情報処理システム２０４の制御に必要な種々のデータを記憶するＲＡＭ２５３と、ネットワークに接続して通信を行う通信Ｉ／Ｆ２５７と、音源からの音声信号を収音するマイクロホン２５８と、各部を接続するバス２６２とを備えている。 FIG. 6 is a diagram illustrating a hardware configuration of the sound source information processing system 204 according to the first embodiment. The sound source information processing system 204 has, as a hardware configuration, a microphone array processing program for executing microphone array processing in the sound source information processing system 204, the above-described sound source information extraction function, sound source state information acquisition function, and processing function in the sound source information processing unit 105. ROM 252 storing a program for executing the processing related to the above, a CPU 251 for controlling each part of the sound source information processing system 204 according to the program in the ROM 252 and executing a buffering time changing process, and a work area are formed, Each unit is connected to a RAM 253 that stores various data necessary for controlling the sound source information processing system 204, a communication I / F 257 that communicates by connecting to a network, a microphone 258 that collects a sound signal from the sound source. With bus 262 It is provided.

先に述べた音源情報処理システム２０４におけるマイクロホンアレイ処理プログラムなどのプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、フロッピー（登録商標）ディスク（ＦＤ）などの磁気ディスク、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録されて提供されてもよい。 Programs such as the microphone array processing program in the sound source information processing system 204 described above are files in an installable or executable format, such as a CD-ROM, a magnetic disk such as a floppy (registered trademark) disk (FD), a DVD. For example, the program may be provided by being recorded on a computer-readable recording medium.

この場合には、マイクロホンアレイ処理プログラムなどのプログラムは、音源情報処理システム２０４において上記記録媒体から読み出して実行することにより主記憶装置上にロードされ、上記ソフトウェア構成で説明した各部が主記憶装置上に生成されるようになっている。 In this case, a program such as a microphone array processing program is loaded on the main storage device by being read from the recording medium and executed by the sound source information processing system 204, and each unit described in the software configuration is stored on the main storage device. To be generated.

また、本実施の形態のマイクロホンアレイ処理プログラムなどのプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成しても良い。 Further, a program such as the microphone array processing program of the present embodiment may be stored on a computer connected to a network such as the Internet and provided by being downloaded via the network.

以上、本発明を第１の実施の形態を用いて説明したが、上記実施の形態に多様な変更または改良を加えることができる。 As described above, the present invention has been described using the first embodiment, but various changes or improvements can be added to the above embodiment.

（第２の実施の形態）
次に本発明を適用した音源情報処理システムの第２の実施の形態について説明する。理解の容易のため、本実施の形態の音源情報処理システムの構成やその処理の流れは第１の実施の形態と同様とし、上記の説明を参照することとしてここではその詳細な説明は省略する。したがって、以下においては第１の実施の形態と異なる点について説明する。 (Second Embodiment)
Next, a second embodiment of the sound source information processing system to which the present invention is applied will be described. For easy understanding, the configuration of the sound source information processing system of this embodiment and the flow of its processing are the same as those of the first embodiment, and the detailed description thereof is omitted here with reference to the above description. . Therefore, differences from the first embodiment will be described below.

第２の実施の形態が第１の実施の形態と異なる点は、過去の音源情報の履歴を上述した音源状況情報として用いることである。すなわち、第２の実施の形態においては、図１に示した音源情報処理部１０５が過去の音源情報を記憶することができる記憶領域を備え、該記憶領域に記録された過去の音源情報を用いて抽出された音源情報に対して所定の処理を施すものである。以下ではこの記憶領域として上述した情報保持手段１０７を利用するものとする。 The second embodiment is different from the first embodiment in that a history of past sound source information is used as the sound source state information described above. That is, in the second embodiment, the sound source information processing unit 105 shown in FIG. 1 includes a storage area in which past sound source information can be stored, and the past sound source information recorded in the storage area is used. A predetermined process is performed on the sound source information extracted in this manner. Hereinafter, the information holding means 107 described above is used as this storage area.

本実施の形態にかかる音源情報処理システムが図３に示した流れで処理をしていくときに、音源情報を抽出してその情報を一定個数または一定時間分だけ情報保持手段１０７に記憶する。本実施の形態においては、一例として種々の音源情報の中でも推定した目的音源の方向を記憶しておくものとする。図７は、情報保持手段１０７に記憶された、推定した目的音源の方向の情報の履歴を示す特性図である。図７においては、過去の一定時期から現在までの一定時間分だけ、時間ごとの推定した音源方向の情報（図７における星印）の履歴が記憶されている。そして推定した目的音源の方向の情報の履歴を音源状況情報として用いて、一定時間毎もしくは任意のタイミングで抽出された音源情報に対して所定の処理を施すことができる。 When the sound source information processing system according to the present embodiment performs processing according to the flow shown in FIG. 3, the sound source information is extracted and the information is stored in the information holding means 107 for a certain number or a certain time. In the present embodiment, as an example, the estimated direction of the target sound source among various sound source information is stored. FIG. 7 is a characteristic diagram showing a history of information on the estimated direction of the target sound source stored in the information holding unit 107. In FIG. 7, a history of information on the sound source direction estimated for each time (star symbol in FIG. 7) is stored for a certain time period from a certain past time to the present time. Then, using the estimated history of the direction of the target sound source as the sound source status information, predetermined processing can be performed on the sound source information extracted at regular time intervals or at arbitrary timing.

そして、本実施の形態にかかる音源情報処理システム２０４においては、前述の記憶された過去の音源方向の情報に対して図７に示す実線のように関数近似を行い、この関数をもとに音源情報に対して所定の処理を施す必要があるか否かを判断する。具体的には、推定した目的音源の方向（角度）が近似関数による推定値より一定値以内の範囲に入っている場合には、推定した目的音源の方向（角度）をそのまま音源情報処理システム２０４からの最終的な出力情報として用いる。 Then, in the sound source information processing system 204 according to the present embodiment, function approximation is performed on the stored information on the past sound source direction as shown by the solid line in FIG. 7, and the sound source is based on this function. It is determined whether or not a predetermined process needs to be performed on the information. Specifically, when the estimated direction (angle) of the target sound source is within a certain range from the estimated value by the approximation function, the estimated direction (angle) of the target sound source is used as it is. Used as the final output information.

一方、推定した目的音源の方向（角度）が近似関数による推定値より一定値以上離れている場合には、最終的な出力情報となる目的音源の方向は近似関数の値を使用することとし、その方向の音声をビームフォームで収音して出力する、または、毎回近似関数の推定方向の音声を収音して出力するなどの処理を施すこともできる。 On the other hand, if the estimated direction (angle) of the target sound source is more than a certain value from the estimated value by the approximation function, the direction of the target sound source that will be the final output information uses the value of the approximation function. It is also possible to perform processing such as picking up and outputting the sound in the direction with the beam form, or picking up and outputting the sound in the direction estimated by the approximation function every time.

関数近似については、移動平均や局所的に線形関数や多項式関数を当てはめるものや、音源の移動にモデルを仮定してそのモデルから推定するものなどを用いることができる。このように関数近似を用いることで、外乱等により目的音源の推定方向が大きく撹乱されることはなくなる。すなわち、過去の推定結果と比較して特異な推定結果が生じた場合などには、その推定結果をそのまま用いるのではなく、過去の経緯を考慮した出力を行うことが可能となる。したがって、上記のように過去の音源情報を上述した音源状況情報として用いて出力対象となる音源情報に所定の処理を施すことにより、より周囲の状況に沿った、より実用性の高い処理を行うことができるという効果が得られる。 As the function approximation, a moving average, a method that applies a linear function or a polynomial function locally, a model that assumes a model for movement of a sound source, and that is estimated from the model can be used. By using function approximation in this way, the estimated direction of the target sound source is not greatly disturbed by disturbances or the like. That is, when a unique estimation result is generated as compared with the past estimation result, the estimation result can be output without considering the past result. Therefore, by applying the predetermined processing to the sound source information to be output using the past sound source information as the above-described sound source status information as described above, more practical processing according to the surrounding situation is performed. The effect that it can be obtained.

（第３の実施の形態）
次に本発明の第３の実施形態について説明する。理解の容易のため、本実施の形態の音源情報処理システムの構成やその処理の流れは第１の実施の形態と同様とし、上記の説明を参照することとしてここではその詳細な説明は省略する。したがって、以下においては第１の実施の形態と異なる点について説明する。 (Third embodiment)
Next, a third embodiment of the present invention will be described. For easy understanding, the configuration of the sound source information processing system of this embodiment and the flow of its processing are the same as those of the first embodiment, and the detailed description thereof is omitted here with reference to the above description. . Therefore, differences from the first embodiment will be described below.

第３の実施の形態が第１の実施の形態と異なる点は、マイクロホン以外の情報入力装置を用いる点である。すなわち、図１に示した構成において、マイクロホンと同様に一つもしくは複数の情報入力装置をさらに用いることである。本実施の形態では図２に示した形態において、自動車内の各座席にユーザが居るか否かを判断するための対人センサが各座席に備えられているものとする。このような対人センサを用いて自動車内の各座席にユーザが居るか否かを判断する手法としては、例えば図８に示すように運転席２０２ａにユーザ２０２が座ったことを感知する圧力センサ２０２ｂを用いることができる。また、カメラで自動車内を撮影して画像認識技術によって各座席にユーザが居るか否かを判断してもよい。このように自動車内の各座席にユーザが居るか否かを判断する手法は特に限定されるものではなく、種々の手法を用いることが可能である。このようにしてマイクロホンにより収音した音声入力以外の情報、すなわちここでは自動車内の各座席にユーザが居るか否かという情報を得ることができる。 The third embodiment differs from the first embodiment in that an information input device other than a microphone is used. That is, in the configuration shown in FIG. 1, one or a plurality of information input devices are further used like the microphone. In this embodiment, in the form shown in FIG. 2, it is assumed that each seat is provided with an interpersonal sensor for determining whether or not there is a user in each seat in the automobile. For example, as shown in FIG. 8, a pressure sensor 202b that senses that the user 202 is seated in the driver's seat 202a is used as a method for determining whether or not there is a user in each seat in the vehicle using such a personal sensor. Can be used. Further, it may be determined whether or not there is a user in each seat by photographing the inside of the car with a camera and using image recognition technology. Thus, the method for determining whether or not there is a user in each seat in the automobile is not particularly limited, and various methods can be used. In this way, information other than the voice input picked up by the microphone, that is, information on whether or not there is a user at each seat in the automobile can be obtained.

以下では、図３に示した流れに従って前記の対人センサ２０２ｂを用いて音源情報に対して所定の処理を施す必要があるか否かの判断を行う処理について、図９に示したフローチャートを用いて説明する。 Hereinafter, with reference to the flowchart shown in FIG. 9, processing for determining whether or not it is necessary to perform predetermined processing on the sound source information using the interpersonal sensor 202 b according to the flow shown in FIG. 3. explain.

まず、ユーザ２０２、２０３と車載情報システム２０５との対話が開始されると（Ｓ９０１）、第１の実施形態と同様に目的音源の情報、特にその推定方向を得る（Ｓ９０２）。次に対人センサ２０２ｂから自動車内の各座席にユーザが居るか否かという情報が得られているかどうかを判断する（Ｓ９０３）。ここでその情報が得られていない場合には（Ｓ９０３否定）、前記のステップ９０２において推定音源情報（推定方向）が得られていても、ユーザからの発声はなかったものとして音源情報処理システム２０４から出力する（Ｓ９０５）。また、前記のステップ９０２とステップ９０３との順序を入れ替えて、対人センサ２０２ｂからの情報があると判断してから目的音声の音源情報を得てもよい。 First, when a dialogue between the users 202 and 203 and the in-vehicle information system 205 is started (S901), information on the target sound source, particularly its estimated direction, is obtained as in the first embodiment (S902). Next, it is determined whether or not information indicating whether or not there is a user at each seat in the automobile is obtained from the interpersonal sensor 202b (S903). If the information is not obtained (No in S903), the sound source information processing system 204 assumes that there is no utterance from the user even if the estimated sound source information (estimated direction) is obtained in the above step 902. (S905). Alternatively, the order of step 902 and step 903 may be switched, and the sound source information of the target voice may be obtained after determining that there is information from the interpersonal sensor 202b.

そして、対人センサ２０２ｂから自動車内の各座席にユーザが居るか否かという情報が得られている場合（Ｓ９０３肯定）、すなわち推定した目的音源の情報（推定方向）と対人センサ２０２ｂからの情報との双方が得られた場合には、反応のあった対人センサの座席ごとに許容される音源方向の角度と、前記の推定方向とを比較し（Ｓ９０４）、目的音源の推定方向が、反応した対人センサ２０２ｂの座席に対応しているか否かを判断する（Ｓ９０６）。この結果、目的音源の推定方向が、反応した対人センサ２０２ｂの座席に対応していない場合には（Ｓ９０６否定）、推定した音源は推定誤りであったか、またはユーザ以外の音声だったと判断して、前記と同様にユーザからの発声はなかったものとして音源情報を出力する（Ｓ９０５）。 When information indicating whether or not there is a user at each seat in the automobile is obtained from the personal sensor 202b (Yes in S903), that is, information on the estimated target sound source (estimated direction) and information from the personal sensor 202b If both are obtained, the angle of the sound source direction allowed for each seat of the responding interpersonal sensor is compared with the estimated direction (S904), and the estimated direction of the target sound source has reacted. It is determined whether or not the seat corresponds to the seat of the personal sensor 202b (S906). As a result, if the estimated direction of the target sound source does not correspond to the seat of the reacted human sensor 202b (No in S906), it is determined that the estimated sound source was an estimation error or was a sound other than the user, As described above, sound source information is output assuming that there is no utterance from the user (S905).

一方、目的音源の推定方向が、反応した対人センサ２０２ｂの座席に対応している場合は（Ｓ９０６肯定）、該当する座席のユーザからの発声だったと判断して、該当する方向にビームフォームをあてた収音を行い、その音源情報を出力する（Ｓ９０７）。 On the other hand, when the estimated direction of the target sound source corresponds to the seat of the responding interpersonal sensor 202b (Yes in S906), it is determined that the speech is from the user of the corresponding seat, and the beamform is applied in the corresponding direction. Sound collection and output the sound source information (S907).

以上のような本実施の形態にかかる音源情報処理システム２０４では、上述したようにマイクロホン以外の情報入力装置からの入力情報を用いることにより、マイクロホンアレイ処理で得られる音源情報に所定の処理を施して適切な出力情報を出力することができる。その結果、外乱等によってマイクロホンアレイ処理結果に誤りが生じた場合においても、異常な出力をして後段の車載情報システム２０５での処理に悪い影響を与えることを防止することができる。したがって、上記のように出力対象となる音源情報に対して所定の処理を施すことにより、より周囲の状況に沿った、より実用性の高い処理を行うことができるという効果が得られる。 In the sound source information processing system 204 according to this embodiment as described above, by using input information from an information input device other than a microphone as described above, predetermined processing is performed on sound source information obtained by microphone array processing. Appropriate output information can be output. As a result, even when an error occurs in the microphone array processing result due to disturbance or the like, it is possible to prevent an abnormal output from adversely affecting the processing in the subsequent vehicle information system 205. Therefore, by performing predetermined processing on the sound source information to be output as described above, it is possible to obtain an effect that more practical processing can be performed according to the surrounding situation.

（第４の実施の形態）
次に本発明の第４の実施形態について説明する。理解の容易のため、本実施の形態の音源情報処理システムの構成やその処理の流れは第１の実施の形態と同様とし、上記の説明を参照することとしてここではその詳細な説明は省略する。したがって、以下においては第１の実施の形態と異なる点について説明する。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described. For easy understanding, the configuration of the sound source information processing system of this embodiment and the flow of its processing are the same as those of the first embodiment, and the detailed description thereof is omitted here with reference to the above description. . Therefore, differences from the first embodiment will be described below.

第４の実施の形態が第１の実施の形態と異なる点は、推定した目的音源と目的音源ではない雑音にあたる音源（非目的音源）の音源情報のいずれか、あるいは両方の内容を推定して、その内容に基づいて、前述の抽出された音源情報に対して所定の処理を施すことである。ここでは理解の容易のため、両方の音源情報の内容から推定する例を図１０及び図１１を用いて説明する。ここで、図１０は、目的音源と非目的音源の音源情報の内容を音源状況情報とした場合の判断条件と音源情報に対する処理内容とを関連付ける手続情報をまとめたテーブルである。このテーブルは、例えば上述した情報保持手段１０７に保持されており、適宜呼び出され、所定の処理が行われる。また図１１は、抽出された音源情報に対して所定の処理を施す必要があるか否かの判断、及びその処理の流れを示すフローチャートである。 The fourth embodiment differs from the first embodiment in that the contents of either or both of the estimated target sound source and the sound source information of the sound source (non-target sound source) corresponding to noise that is not the target sound source are estimated. Then, based on the contents, predetermined processing is performed on the extracted sound source information. Here, for easy understanding, an example of estimation from the contents of both sound source information will be described with reference to FIGS. Here, FIG. 10 is a table summarizing procedure information for associating the determination condition when the content of the sound source information of the target sound source and the non-target sound source is the sound source status information and the processing content for the sound source information. This table is held in, for example, the information holding unit 107 described above, and is called as appropriate to perform predetermined processing. FIG. 11 is a flowchart showing a determination as to whether or not a predetermined process needs to be performed on the extracted sound source information and the flow of the process.

図３の処理の流れに従ってマイクロホンアレイ処理により、目的音源の音源情報とその方向を死角とするビームフォームから得られる音源情報が得られたものとする。ここで、「その方向を死角とする」とは、「その特定の方向について指向性を外して収音すること」を意味する。これにより、その特定の方向以外の音声を収音することができる。すなわち、目的音声の音源情報と、雑音にあたる非目的音源の音源情報とが得られる。 It is assumed that sound source information of a target sound source and sound source information obtained from a beamform whose direction is a blind spot are obtained by microphone array processing according to the processing flow of FIG. Here, “the direction is the blind spot” means “the sound is collected with the directivity removed from the specific direction”. As a result, it is possible to pick up sound other than that specific direction. That is, the sound source information of the target sound and the sound source information of the non-target sound source corresponding to noise can be obtained.

そして、この両者の音源情報から、目的音声、非目的音源（雑音）の内容をそれぞれ推定する。ここでは目的音声は自動車内のユーザの発声であって人間の声であるので、収音された音声が人間の声であるか否かに焦点をあてて収音された音源情報の内容を推定する。音源情報の内容の推定は、例えば音声波形の零交差やスペクトラムを利用するなど既存の技術で実現できるが、このような判断手法自体は本発明とは直接は関係ないのでここでは詳細な説明は省略する。 Then, the contents of the target speech and the non-target sound source (noise) are estimated from the sound source information of both. Here, since the target voice is the voice of the user in the car and is a human voice, the content of the collected sound source information is estimated by focusing on whether the collected voice is a human voice or not. To do. The estimation of the content of the sound source information can be realized by an existing technique, for example, using a zero crossing or a spectrum of a speech waveform. However, since such a determination method itself is not directly related to the present invention, a detailed explanation is given here. Omitted.

一方、非目的音源（雑音）については、非目的音源（雑音）の発生の様子、すなわち該非目的音源（雑音）が定常的なものであるか、非定常なものであるかに焦点をあてて内容を推定する。非目的音源（雑音）の内容の推定は、例えば音量の時間的な変化などを利用するなど既存の技術で実現できるが、このような定常雑音か否かの判断手法自体は本発明と直接は関係ないので、その詳細は省略する。 On the other hand, with respect to the non-target sound source (noise), the state of generation of the non-target sound source (noise), that is, whether the non-target sound source (noise) is stationary or non-stationary is focused. Estimate the contents. The estimation of the content of the non-target sound source (noise) can be realized by existing techniques, for example, using a temporal change in volume, but such a determination method as to whether or not it is stationary noise itself is not directly related to the present invention. Details are omitted because they are not relevant.

以上のようにして目的音声と非目的音源（雑音）との内容が推定されたときに、その内容に基づいて前述の抽出された音源情報に対して以下に説明するような所定の処理が施される。 When the contents of the target speech and the non-target sound source (noise) are estimated as described above, predetermined processing as described below is performed on the extracted sound source information based on the contents. Is done.

まず、目的音源の内容を推定して（Ｓ１１０１）、推定された目的音源が人の発声であるか否かを判断する（Ｓ１１０２）。推定された目的音源が人の発声である場合には（Ｓ１１０２肯定）、次に非目的音源の内容を推定し、推定された非目的音源が定常雑音であるか否かを判断する（Ｓ１１０３）。ここで、推定された非目的音源が定常雑音である場合には（Ｓ１１０３肯定）、音源情報抽出にとっては都合の良い状況であり、推定誤りが生じる可能性は少ない。したがって、この場合には推定された通りの音源情報を用いて座席方向のみの音声を収音して（Ｓ１１０４）、それを音源情報処理システム２０４の出力として扱う。 First, the content of the target sound source is estimated (S1101), and it is determined whether or not the estimated target sound source is a human utterance (S1102). If the estimated target sound source is a human utterance (Yes at S1102), then the contents of the non-target sound source are estimated, and it is determined whether or not the estimated non-target sound source is stationary noise (S1103). . Here, when the estimated non-target sound source is stationary noise (Yes in S1103), the situation is convenient for sound source information extraction, and there is little possibility of an estimation error. Therefore, in this case, the sound source information as estimated is used to pick up the sound only in the seat direction (S1104), and this is handled as the output of the sound source information processing system 204.

一方、推定された非目的音源が定常雑音であるか否かを判断した結果、推定された非目的音源が定常雑音でないと判断された場合は（Ｓ１１０３否定）、音源情報抽出においては推定誤りが生じやすい状況であると考えられる。例えば、突発的な雑音を人の発声と推定してしまい、人の発声を非定常音と推定してしまうという状況が発生し得る。そこでこのような場合は後段での誤動作を防ぐために、当初の目的音源の推定方向を修正して運転席方向のみに注目して収音を行うという処理を施し（Ｓ１１０５）、それを音源情報処理システム２０４の出力として扱う。このような処理を施すことで、突発的な雑音を誤って人の発声と推定して音源情報処理システムから出力してしまう可能性を効果的に減少させることができる。 On the other hand, as a result of determining whether or not the estimated non-target sound source is stationary noise, if it is determined that the estimated non-target sound source is not stationary noise (No in S1103), an estimation error occurs in sound source information extraction. This is likely to occur. For example, a situation may occur in which sudden noise is estimated as a person's utterance and a person's utterance is estimated as an unsteady sound. Therefore, in such a case, in order to prevent a malfunction in the subsequent stage, processing is performed in which the estimated direction of the original target sound source is corrected and the sound is collected while paying attention only to the driver's seat direction (S1105). Treated as output of system 204. By performing such processing, it is possible to effectively reduce the possibility that sudden noise is erroneously estimated as human speech and output from the sound source information processing system.

また、目的音源の内容を推定し（Ｓ１１０１）、推定された目的音源が人の発声ではないと判断された場合においても（Ｓ１１０２否定）、次に非目的音源の内容を推定し、推定された非目的音源が定常雑音であるか否かを判断する（Ｓ１１０６）。ここで、推定された非目的音源が定常雑音である場合には（Ｓ１１０６肯定）、定常雑音であることを明示してそれを音源情報処理システムの出力として扱う（Ｓ１１０７）。これにより後段のシステムでは、得られた定常雑音から雑音の内容を推定することで、一般に用いられる雑音抑圧法を用いることができ、音声認識や音声強調などの自身の行う処理の精度をより高めることができる。 Further, the content of the target sound source is estimated (S1101), and even when it is determined that the estimated target sound source is not a human utterance (No in S1102), the content of the non-target sound source is estimated and estimated. It is determined whether or not the non-target sound source is stationary noise (S1106). Here, when the estimated non-target sound source is stationary noise (Yes in S1106), it is clearly indicated that it is stationary noise and handled as an output of the sound source information processing system (S1107). As a result, in the latter stage system, it is possible to use a commonly used noise suppression method by estimating the content of the noise from the obtained stationary noise, and further improve the accuracy of processing performed by itself such as speech recognition and speech enhancement. be able to.

一方、推定された非目的音源が定常雑音であるか否かを判断した結果、推定された非目的音源が定常雑音でないと判断された場合には（Ｓ１１０６否定）、２通りの状況が考えられる。すなわち、第１の状況は、実際にユーザの発声がなく非定常雑音があった場合である。そして第２の状況は、音源定位に誤りがあって雑音を目的音源として推定し、ユーザの発声を非定常雑音として推定してしまった場合である。そこで、この場合には、推定内容のチェックを行う（Ｓ１１０８）。すなわち、目的音源と非目的音源の推定方向を入れ替えて、最初の非目的音源方向を目的音源方向とみなして人の発声か否かを再度判断するという処理を行い、非目的音源方向が人の発声か否かを判断する（Ｓ１１０９）。 On the other hand, as a result of determining whether or not the estimated non-target sound source is stationary noise, if it is determined that the estimated non-target sound source is not stationary noise (No in S1106), two situations can be considered. . That is, the first situation is a case where there is actually no utterance of the user and there is non-stationary noise. The second situation is when there is an error in sound source localization, noise is estimated as the target sound source, and the user's utterance is estimated as non-stationary noise. Therefore, in this case, the estimated content is checked (S1108). In other words, the estimated direction of the target sound source and the non-target sound source is switched, the first non-target sound source direction is regarded as the target sound source direction, and it is determined again whether or not it is a human utterance. It is determined whether or not the voice is uttered (S1109).

この場合の判断において非目的音源が人の発声と推定された場合には（Ｓ１１０９肯定）、非目的音源方向からの音声をユーザからの発声とみなしてこれを音源情報処理システム２０４の出力として扱う（Ｓ１１１０）。一方、ここでの判断においても人の発声と推定されない場合には（Ｓ１１０９否定）、ユーザからの発声はなかったものとして扱う（Ｓ１１１１）。 If it is determined in this case that the non-target sound source is a human utterance (Yes in S1109), the sound from the non-target sound source direction is regarded as the utterance from the user, and this is handled as the output of the sound source information processing system 204. (S1110). On the other hand, if it is not estimated that the voice is a human voice in this determination (No in S1109), it is handled that there is no voice from the user (S1111).

以上のような本実施の形態にかかる音源情報処理システム２０４では、上述したように音源の内容を推定し、これを音源状況情報として用いることにより、マイクロホンアレイ処理で得られる音源情報に所定の処理を施して適切な出力情報を出力することができる。したがって、上記のように出力対象となる音源情報に対して所定の処理を施すことにより、より周囲の状況に沿った、より実用性の高い処理を行うことができるという効果が得られる。 In the sound source information processing system 204 according to the present embodiment as described above, the content of the sound source is estimated as described above, and this is used as sound source status information, whereby predetermined processing is performed on the sound source information obtained by the microphone array processing. To output appropriate output information. Therefore, by performing predetermined processing on the sound source information to be output as described above, it is possible to obtain an effect that more practical processing can be performed according to the surrounding situation.

（第５の実施の形態）
次に本発明の第５の実施形態について説明する。理解の容易のため、本実施の形態の音源情報処理システムの構成やその処理の流れは第１の実施の形態と同様とし、上記の説明を参照することとしてここではその詳細な説明は省略する。したがって、以下においては第１の実施の形態と異なる点について説明する。 (Fifth embodiment)
Next, a fifth embodiment of the present invention will be described. For easy understanding, the configuration of the sound source information processing system of this embodiment and the flow of its processing are the same as those of the first embodiment, and the detailed description thereof is omitted here with reference to the above description. . Therefore, differences from the first embodiment will be described below.

第５の実施の形態が第１の実施の形態と異なる点は、推定された音源情報の確信度を上述した音源状況情報として用いることである。また、本実施の形態においては、音源情報処理システム２０４からの出力は、車の中のどこからユーザの発声が来ているのかとその方向、あるいはユーザの発声がない（有益な情報がない）、または車載情報システム２０５を介してユーザの確認を要求するという、大別して３種の意味を持つものとする。こうすることにより、車載情報システム２０５において、擬人化エージェントを用いてユーザと対話をする際に、発声したユーザに対して擬人化エージェントが適切に処理を行うことを可能とする。 The fifth embodiment is different from the first embodiment in that the certainty factor of the estimated sound source information is used as the sound source state information described above. Further, in the present embodiment, the output from the sound source information processing system 204 is where the utterance of the user comes from in the car and its direction, or there is no utterance of the user (no useful information), Alternatively, it is assumed that the user's confirmation is requested via the in-vehicle information system 205, and roughly has three kinds of meanings. In this way, when the in-vehicle information system 205 uses the anthropomorphic agent to interact with the user, the anthropomorphic agent can appropriately process the uttered user.

音源情報処理システム２０４においてはユーザが車内で発声したということを推定すると、その音源方向を求める。この場合の車内における音源方向の規定を図１２に示す。ここでは、右ハンドル２１５を備えた自動車の場合について説明する。 When the sound source information processing system 204 estimates that the user uttered in the car, the sound source direction is obtained. FIG. 12 shows the definition of the sound source direction in the vehicle in this case. Here, a case of an automobile provided with the right handle 215 will be described.

自動車の幅方向における中心部を通って自動車の後方に延びる仮想線の方向を０度としたときに、運転席方向を正方向、助手席方向を負方向とする。音源方向が＋１５度〜＋４５度の範囲は運転席２０２ａに対応するものとし、−１５度〜−４５度の範囲は助手席２０３ａに対応するものとする。また、音源方向が−１５度〜＋１５度の範囲は後部座席２２０に対応するものとする。これらの音源方向情報から該当座席情報への変換は、音源情報処理部１０５が音源状況情報取得機能により、マイクロホンアレイ処理部１０４で推定した音源の方向情報を入力として情報保持手段１０７に格納された所定の写像情報のテーブル（図示せず）を参照することにより行うことができる。そして、音源情報処理部１０５では、その処理機能によりこの変換した音源状況情報として用いることができる。ここでの写実情報とは、音源方向が何度から何度までの範囲は運転席に対応する、何度から何度までの範囲は助手席に対応する、といった音源方向情報から該当座席情報への変換を可能とする手続情報である。 When the direction of the imaginary line extending through the center in the width direction of the vehicle and extending rearward of the vehicle is 0 degree, the driver seat direction is the positive direction and the passenger seat direction is the negative direction. The range of the sound source direction from +15 degrees to +45 degrees corresponds to the driver seat 202a, and the range from −15 degrees to −45 degrees corresponds to the passenger seat 203a. The range of the sound source direction from −15 degrees to +15 degrees corresponds to the rear seat 220. The conversion from the sound source direction information to the corresponding seat information is stored in the information holding means 107 by using the sound source information processing unit 105 as the input of the sound source direction information estimated by the microphone array processing unit 104 by the sound source state information acquisition function. This can be done by referring to a table (not shown) of predetermined mapping information. The sound source information processing unit 105 can use the converted sound source status information by its processing function. The actual information here refers to the corresponding seat information from the sound source direction information, such as the range of the sound source direction from how many times corresponds to the driver's seat, and the range of the sound source direction corresponds to the passenger seat. Is procedural information that enables conversion of

また、この写実情報は、音源情報処理システム２０４の外部から書き換え可能とする。このように写実情報を音源情報処理システム２０４の外部から書き換え可能とすることにより、周囲の環境が変化した場合においても音源の情報に対する処理の判断を的確に行うことができる。例えば今まで右ハンドルの自動車で使っていた音源情報処理システム２０４を左ハンドルの自動車に持っていった場合に、角度に対応した写像情報を書き換えるだけで、音源の情報に対する処理内容を書き換えることなく、目的にあった処理を行うことができる。すなわち、使用環境が変化した際の利便性に優れた音源情報処理システム２０４が実現できる。 The real information can be rewritten from outside the sound source information processing system 204. By making the real information rewriteable from the outside of the sound source information processing system 204 as described above, it is possible to accurately determine the processing for the sound source information even when the surrounding environment changes. For example, when the sound source information processing system 204 that has been used in a right-hand drive car is brought to a left-hand drive car, the processing contents for the sound source information can be rewritten without rewriting the mapping information corresponding to the angle. , Processing suitable for the purpose can be performed. That is, the sound source information processing system 204 excellent in convenience when the use environment changes can be realized.

例えばロボットに音源情報処理システム２０４が使われている場合に、音源方向が何度から何度までの範囲はロボットの顔が向いている方向であるということがわかるとする。こうすることで、顔の向いている方向ならば音源推定の確信度が高くなくても確実に応答するようにし、確信度が高くない場合には音源推定の確信度が所定の閾値以上でないとその方向の音声は無視するような処理を施すことで、外乱に強くすることができる。また、上記のような角度に対応した写像情報を書き換えるだけで、「ロボットにとっての顔の向いている方向」の概念を適宜変更することができ、ロボットの使用形態の自由度を大きくすることができる。 For example, when the sound source information processing system 204 is used for a robot, it is assumed that the range of the sound source direction from the number of times to the number of times is the direction in which the robot's face is facing. By doing this, if the direction of the face is facing, it will respond reliably even if the certainty of the sound source estimation is not high, and if the certainty is not high, the certainty of the sound source estimation must be greater than or equal to a predetermined threshold By applying a process that ignores the sound in that direction, it is possible to make it stronger against disturbance. In addition, by simply rewriting the mapping information corresponding to the angle as described above, the concept of “the direction in which the robot is facing” can be changed as appropriate, and the degree of freedom of use of the robot can be increased. it can.

なお、当然のことながらここでの角度の規定は本実施の形態を説明するための一例であり、自動車の形態によって変更する、左右ハンドルに合わせて正負方向を変えるなど適宜変更することが可能である。すなわち音源方向に応じてなんらかの意味付けができ、それが音源情報と関連づけができればよい。 It should be noted that the definition of the angle here is only an example for explaining the present embodiment, and it can be changed as appropriate depending on the form of the automobile, changing the positive / negative direction according to the left and right steering wheel. is there. That is, it is only necessary that some meaning can be given according to the sound source direction and it can be associated with the sound source information.

本実施の形態においては、音源方向を推定すると同時にその推定の確信度を見積もるものとする。この確信度の見積もりは、マイクロホンアレイ処理部１０４において音源方向の推定と共に行われる。この確信度は、音源定位に用いる適応フィルタの収束後の残差、推定に用いたモデルとの誤差、音量と相関をもつ値、人間の声らしさの尺度、音源方向の時間的な変化やその分散など音源情報抽出に関連する特徴量から設定されるものとして既存の技術で実現できるが、このような確信度の算出手法自体は本発明とは直接は関係ないのでここでは詳細な説明は省略する。 In the present embodiment, the sound source direction is estimated and at the same time, the certainty of the estimation is estimated. The certainty factor is estimated together with the sound source direction in the microphone array processing unit 104. This certainty factor is the residual after convergence of the adaptive filter used for sound source localization, the error with the model used for estimation, the value correlated with the volume, the measure of human voiceness, the temporal change of the sound source direction and its Although it can be realized by existing technology as being set from features related to sound source information extraction such as variance, the reliability calculation method itself is not directly related to the present invention, so detailed description is omitted here. To do.

本実施の形態における確信度はこれらの組み合わせから得られるものとし、その確信度は高、中、低の３つの離散値をとるものとする。また、確信度が関数で連続的に得られていたものを閾値で離散化したものとしてもよい。そして、ここでは確信度を離散値で説明するが、連続値を用いたものについても本発明は問題なく適用することができる。 The certainty factor in the present embodiment is obtained from these combinations, and the certainty factor assumes three discrete values of high, medium, and low. Moreover, it is good also as what discretized with the threshold value what was obtained with the function continuously. In this embodiment, the certainty factor is described as a discrete value, but the present invention can be applied without problems to those using continuous values.

このようにして得られた音源方向の情報と、音源方向の推定の確信度を用いて前述の抽出された音源情報に対して所定の処理を施す処理について図１３を参照して説明する。まず、音源方向の推定の確信度が高である（高レベルである）場合について説明する。音源方向の推定の確信度が高である場合であって、推定された音源方向が−１５〜＋１５度の範囲にある場合は上述した写実情報により音源の座席は後部座席２２０と判断され、後部座席２２０のユーザから発声があったものと判断してマイクロホンアレイ処理部１０４で推定された音源情報をその音源方向の情報とともに出力する。 A process of performing a predetermined process on the extracted sound source information using the information on the sound source direction obtained in this way and the certainty of estimation of the sound source direction will be described with reference to FIG. First, the case where the certainty factor of the estimation of the sound source direction is high (high level) will be described. When the certainty of the estimation of the sound source direction is high and the estimated sound source direction is in the range of −15 to +15 degrees, the sound source seat is determined as the rear seat 220 based on the above-described real information, and the rear The sound source information estimated by the microphone array processing unit 104 is output together with the information of the sound source direction by determining that there is a utterance from the user of the seat 220.

また、音源方向の推定の確信度が高である場合であって、推定された音源方向が−１５〜＋１５度の範囲外である場合には、上述した写実情報により音源の座席は運転席２０２ａあるいは助手席２０３ａと判断され、運転席あるいは助手席でユーザからの発声があったものと判断してマイクロホンアレイ処理部１０４で推定された音源情報を音源方向（角度）の情報とともに出力する。特に推定された音源方向が＋４５〜＋９０度のように座席から離れた方向である場合においても、ユーザが移動したものと判断して、運転席あるいは助手席でユーザからの発声があったものと判断してマイクロホンアレイ処理部１０４で推定された音源情報を音源方向の情報とともに出力する。 Further, when the certainty of the estimation of the sound source direction is high and the estimated sound source direction is outside the range of −15 to +15 degrees, the seat of the sound source is the driver seat 202a according to the above-described real information. Alternatively, it is determined as the passenger seat 203a, and it is determined that there is a voice from the user at the driver seat or the passenger seat, and the sound source information estimated by the microphone array processing unit 104 is output together with the information of the sound source direction (angle). In particular, even when the estimated sound source direction is a direction away from the seat such as +45 to +90 degrees, it is determined that the user has moved, and there is a voice from the user at the driver seat or the passenger seat. The sound source information determined and estimated by the microphone array processing unit 104 is output together with information on the sound source direction.

次に、音源方向の推定の確信度が中である（中レベルである）場合について説明する。音源方向の推定の確信度が中である場合であって、＋１５度〜＋４５度または−１５度〜−４５度の範囲にある場合は、上述した写実情報により音源の座席は運転席２０２ａ、助手席２０３ａ、または後部座席２２０と判断され、運転席２０２ａ、助手席２０３ａ、または後部座席２２０からユーザの発声があったものと判断してマイクロホンアレイ処理部１０４で推定された音源情報を音源方向の情報とともに出力する。 Next, the case where the certainty factor of the estimation of the sound source direction is medium (medium level) will be described. When the certainty of the estimation of the sound source direction is medium and in the range of + 15 ° to + 45 ° or −15 ° to −45 °, the seat of the sound source is the driver's seat 202a and the assistant according to the above-described real information. The sound source information that is determined to be the seat 203a or the rear seat 220, is determined to have been uttered by the user from the driver seat 202a, the passenger seat 203a, or the rear seat 220 and is estimated by the microphone array processing unit 104 in the direction of the sound source. Output with information.

また、音源方向の推定の確信度が中である場合であって、推定された音源方向が＋１５度〜＋４５度または−１５度〜−４５度の範囲外である場合には、ユーザからの発声があったか否かは確実に判断することができない。したがって、この場合は音源情報処理システム２０４からは、推定した音源方向をそのまま出力するのでなく、音源の確認を要求する出力情報を出力する。例えば情報保持手段１０７に予め格納されている所定の情報（「判断不能」や「音源再確認」などの趣旨の情報）を出力する。このような処理を行うことにより、例えば車載情報システム２０５ではユーザに対して音源の確認を要求するために「もう一度発声してください」と呼びかけることもできる。これにより、音源情報処理システム２０４からの不確定な情報に起因して車載情報システム２０５が誤作動を起こすことを防止することができる。 When the certainty of the estimation of the sound source direction is medium and the estimated sound source direction is outside the range of +15 degrees to +45 degrees or −15 degrees to −45 degrees, the utterance from the user Whether or not there is a problem cannot be determined with certainty. Therefore, in this case, the sound source information processing system 204 does not output the estimated sound source direction as it is, but outputs output information for requesting confirmation of the sound source. For example, predetermined information stored in advance in the information holding unit 107 (information indicating “not determined”, “sound source reconfirmation”, etc.) is output. By performing such processing, for example, the in-vehicle information system 205 can call the user “speak again” in order to request the user to confirm the sound source. Thereby, it is possible to prevent the in-vehicle information system 205 from malfunctioning due to indefinite information from the sound source information processing system 204.

次に、音源方向の推定の確信度が低である（低レベルである）場合について説明する。音源方向の推定の確信度が低である場合であって、推定された音源方向が＋１５度〜＋４５度または−１５度〜−４５度の範囲にある場合は、推定された音源がユーザの発声であるかもしれないが確実に判断することができない。したがって、この場合は音源情報処理システム２０４からは、推定した音源方向をそのまま出力するのでなく、音源の確認を要求する出力情報を出力する。例えば情報保持手段１０７に予め格納されている所定の情報（「判断不能」や「音源再確認」などの趣旨の情報）を出力する。このような処理を行うことにより、上記と同様に車載情報システム２０５ではユーザに対して音源の確認を要求することができる。これにより、音源情報処理システム２０４からの不確定な情報に起因して車載情報システム２０５が誤作動を起こすことを防止することができる。 Next, a case where the certainty factor of the sound source direction estimation is low (low level) will be described. When the certainty of estimation of the sound source direction is low and the estimated sound source direction is in the range of +15 degrees to +45 degrees or −15 degrees to −45 degrees, the estimated sound source is the voice of the user. Although it may be, it cannot be judged reliably. Therefore, in this case, the sound source information processing system 204 does not output the estimated sound source direction as it is, but outputs output information for requesting confirmation of the sound source. For example, predetermined information stored in advance in the information holding unit 107 (information indicating “not determined”, “sound source reconfirmation”, etc.) is output. By performing such processing, the in-vehicle information system 205 can request the user to confirm the sound source as described above. Thereby, it is possible to prevent the in-vehicle information system 205 from malfunctioning due to indefinite information from the sound source information processing system 204.

また、音源方向の推定の確信度が低である場合であって、推定された音源方向が＋１５度〜＋４５度または−１５度〜−４５度の範囲外である場合には、確信度が低く、且つ音源方向もユーザからの発声とは考えにくい。したがって、この場合は音源情報処理システム２０４からは、推定した音源方向をそのまま出力するのでなく、ユーザからの発声が無い（有益な音源がない）ものとして、ユーザからの発声が無いという出力情報を出力する。例えば情報保持手段１０７に予め格納されている所定の情報（「音声入力無し」や「出力無し」などの趣旨の情報）を出力する。このような処理を行うことにより、車載情報システム２０５では音源情報処理システム２０４からの不確定な情報に起因して車載情報システム２０５が誤作動を起こすことを防止することができる。 In addition, when the certainty factor of the estimation of the sound source direction is low and the estimated sound source direction is out of the range of +15 degrees to +45 degrees or -15 degrees to -45 degrees, the certainty degree is low. In addition, it is difficult to consider the direction of the sound source as the utterance from the user. Therefore, in this case, the sound source information processing system 204 does not output the estimated sound source direction as it is, but outputs output information that there is no utterance from the user, assuming that there is no utterance from the user (no useful sound source). Output. For example, predetermined information (information indicating “no voice input”, “no output” or the like) stored in advance in the information holding unit 107 is output. By performing such a process, the in-vehicle information system 205 can prevent the in-vehicle information system 205 from malfunctioning due to indefinite information from the sound source information processing system 204.

以上のような本実施の形態にかかる音源情報処理システム２０４では、ユーザからの発声に対して、その推定された音源角度や推定の確信度を上述した音源状況情報として用いて、前述の抽出された音源情報に対して所定の処理を施すことができる。そして、この音源情報処理システム２０４を用いた擬人化エージェントでは、音源情報処理システム２０４からの出力に基づいて処理を行うことにより、より適切な処理を行うことができる。すなわち、車載情報システム２０５において擬人化エージェントを用いてユーザと対話をする際に、発声したユーザに対して擬人化エージェントがより周囲の状況に沿った、より確実性の高い応答を行うことができるという効果が得られる。 In the sound source information processing system 204 according to the present embodiment as described above, the above-described extraction is performed using the estimated sound source angle and the certainty of estimation as the above-described sound source state information with respect to the utterance from the user. A predetermined process can be performed on the sound source information. The anthropomorphic agent using the sound source information processing system 204 can perform more appropriate processing by performing processing based on the output from the sound source information processing system 204. That is, when interacting with a user using an anthropomorphic agent in the in-vehicle information system 205, the anthropomorphic agent can make a more reliable response to the uttered user according to the surrounding situation. The effect is obtained.

以上のように、本発明にかかる音源情報処理装置は、周囲雑音のある環境下で目的音のみを収音する場合に有用であり、特に、カーナビゲーションシステムやテレビ電話などの用途に適している。 As described above, the sound source information processing apparatus according to the present invention is useful when collecting only a target sound in an environment with ambient noise, and is particularly suitable for applications such as a car navigation system and a video phone. .

本発明を適用した音源情報処理システムの一構成例を示すブロック図である。It is a block diagram which shows one structural example of the sound source information processing system to which this invention is applied. 本発明を適用した音源情報処理システムを自動車内において適用した形態例について説明する図である。It is a figure explaining the example which applied the sound source information processing system to which this invention is applied in the motor vehicle. 音源情報処理システムにおける処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of a process in a sound source information processing system. 雑音の強さを音源状況情報とした場合の判断条件と音源情報に対する処理内容とを関連付ける手続情報をまとめたテーブルを示す図である。It is a figure which shows the table which put together the procedure information which associates the judgment conditions at the time of making noise intensity into sound source status information, and the processing content with respect to sound source information. 抽出された音源情報に対して所定の処理を施す必要があるか否かの判断、及びその処理の流れを示すフローチャートである。It is a flowchart which shows the judgment whether it is necessary to perform a predetermined | prescribed process with respect to the extracted sound source information, and the flow of the process. 実施の形態１に係る音源情報処理システムのハードウェア構成を示す図である。2 is a diagram illustrating a hardware configuration of a sound source information processing system according to Embodiment 1. FIG. 情報保持手段に記憶された、推定した目的音源の方向の情報の履歴を示す特性図である。It is a characteristic view which shows the log | history of the information of the direction of the estimated target sound source memorize | stored in the information holding means. ユーザが座ったことを感知する圧力センサが運転席に配置された状態を示す図である。It is a figure which shows the state by which the pressure sensor which detects that the user sat down was arrange | positioned at the driver's seat. 対人センサを用いて音源情報に対して所定の処理を施す必要があるか否かの判断を行う処理の流れを説明するフローチャートである。It is a flowchart explaining the flow of a process which judges whether it is necessary to perform a predetermined process with respect to sound source information using a personal sensor. 目的音源と非目的音源の音源情報の内容を音源状況情報とした場合の判断条件と音源情報に対する処理内容とを関連付ける手続情報をまとめたテーブルを示す図である。It is a figure which shows the table which put together the procedure information which associates the judgment condition when the content of the sound source information of a target sound source and a non-target sound source is sound source status information, and the processing content for the sound source information. 抽出された音源情報に対して所定の処理を施す必要があるか否かの判断、及びその処理の流れを示すフローチャートである。It is a flowchart which shows the judgment whether it is necessary to perform a predetermined | prescribed process with respect to the extracted sound source information, and the flow of the process. 車内における音源方向の規定を表した図である。It is a figure showing the regulation of the sound source direction in the car. 音源方向と音源情報推定の確信度とを音源状況情報とした場合の判断条件と音源情報に対する処理内容とを関連付ける手続情報をまとめたテーブルを示す図である。It is a figure which shows the table which put together the procedure information which associates the judgment condition at the time of making a sound source direction and the reliability of sound source information estimation into sound source status information, and the processing content with respect to sound source information.

Explanation of symbols

１０１−１〜１０１−Ｎマイクロホン
１０２入力処理部
１０３信号処理部
１０４マイクロホンアレイ処理部
１０５音源情報処理部
１０６出力
１０７情報保持手段
２０１自動車
２０２運転席のユーザ
２０２ａ運転席
２０３助手席のユーザ
２０３ａ助手席
２０４音源情報処理システム
２０５車載情報システム
２０６自動車の操作系 101-1 to 101-N microphone 102 input processing unit 103 signal processing unit 104 microphone array processing unit 105 sound source information processing unit 106 output 107 information holding unit 201 automobile 202 driver seat user 202a driver seat 203 passenger seat user 203a passenger seat 204 Sound source information processing system 205 In-vehicle information system 206 Car operation system

Claims

A plurality of sound collecting means for collecting sound signals of a plurality of sound sources;
Sound source estimation means for estimating the direction of at least one sound source based on the sound signals collected by the plurality of sound collection means;
Sound source information extracting means for extracting sound source information related to a target sound source among the sound sources in the estimated direction;
Sound source status information acquisition means for acquiring sound source status information relating to at least one of one or a plurality of sound sources and surrounding conditions for selecting the processing content to be applied to the extracted sound source information;
Processing means for performing predetermined processing on the extracted sound source information based on the sound source status information;
A sound source information processing apparatus comprising:

Holding means for holding processing content to be performed on the extracted sound source information based on the sound source status information;
The sound source information processing apparatus according to claim 1, wherein the processing unit calls a predetermined processing procedure from the holding unit to execute processing.

The sound source status information acquisition unit holds a history of sound source information of the sound source estimated by the sound source estimation unit,
The sound source information processing apparatus according to claim 1, wherein the processing unit uses a history of the sound source information as the sound source state information.

The sound source information processing apparatus according to claim 1, wherein the processing unit performs a predetermined process on the extracted sound source information at regular time intervals or at an arbitrary timing.

The sound source information processing apparatus according to claim 1, wherein the sound source status information acquisition unit acquires the sound source status information from an information input unit other than the sound collection unit.

The sound source status information acquisition means estimates the content of the sound source from the sound signal of the sound source estimated by the sound source estimation means,
The sound source information processing apparatus according to claim 1, wherein the processing unit uses the content of the sound source as the sound source status information.

The sound source status information acquisition means estimates the content of the sound source other than the estimated sound source direction from the sound signal collected by removing the directivity with respect to the sound source direction of the sound source estimated by the sound source estimation means. ,
The sound source information processing apparatus according to claim 1, wherein the processing unit uses sound source content other than the estimated sound source direction as the sound source state information.

The sound source status information acquisition means converts the sound source direction information estimated by the sound source estimation means into predetermined sound source status information as an input,
The sound source information processing apparatus according to claim 1, wherein the processing unit uses the converted sound source state information.

The sound source estimation means estimates a sound source and estimates a certainty of estimation;
The sound source status information acquisition means acquires the certainty factor of the estimation as the sound source status information,
The sound source information processing apparatus according to claim 1, wherein the processing unit uses the certainty factor of the estimation as the sound source state information.

The sound source information processing apparatus according to claim 1, wherein the processing unit outputs predetermined information stored in advance.

A sound collection process for collecting audio signals of a plurality of sound sources;
A sound source estimating step of estimating a direction of at least one sound source based on the sound signal collected in the sound collecting step;
A sound source information extracting step for extracting sound source information related to an output target from among the sound sources in the estimated direction;
A sound source status information acquisition step for acquiring sound source status information related to at least one of one or a plurality of sound sources and surrounding conditions for selecting the processing content to be performed on the extracted sound source information;
A processing step of performing a predetermined process on the extracted sound source information based on the sound source status information;
A sound source information processing method comprising:

A sound collection process for collecting audio signals of a plurality of sound sources;
A sound source estimating step of estimating a direction of at least one sound source based on the sound signal collected in the sound collecting step;
A sound source information extracting step for extracting sound source information related to an output target from among the sound sources in the estimated direction;
A sound source status information acquisition step for acquiring sound source status information related to at least one of one or a plurality of sound sources and surrounding conditions for selecting the processing content to be performed on the extracted sound source information;
A processing step of performing a predetermined process on the extracted sound source information based on the sound source status information;
A sound source information processing program for causing a computer to execute.