JP2017211637A

JP2017211637A - Acoustic signal processing device, acoustic signal processing method, and program

Info

Publication number: JP2017211637A
Application number: JP2017039697A
Authority: JP
Inventors: 吉田　実; Minoru Yoshida; 実吉田; 亮人相場; Akihito Aiba
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2016-05-20
Filing date: 2017-03-02
Publication date: 2017-11-30

Abstract

PROBLEM TO BE SOLVED: To allow transmission of noise to be suppressed.SOLUTION: An acoustic signal processing device according to one embodiment includes: a beam former unit for generating a target signal, i.e. an acoustic signal corresponding to sound arriving from an object direction for each of a plurality of object directions on the basis of acoustic signals on a plurality of channels output by a plurality of microphones; a feature quantity calculation unit for calculating a feature quantity in each object direction on the basis of the target signal of each object direction; and a direction selection unit for selecting an output direction from among the object directions on the basis of the feature quantity of each object direction.SELECTED DRAWING: Figure 1

Description

本発明は、音響信号処理装置、音響信号処理方法、及びプログラムに関する。 The present invention relates to an acoustic signal processing device, an acoustic signal processing method, and a program.

音声会議システムやテレビ会議システム（以下、「音声会議システム等」という）では、マイクロホンで集音した音声を通信相手に送信することにより、音声通話が行われる。快適な音声通話を実現するためには、発話された音声を高精度に集音することが重要である。従来、発話された音声を集音する方法として、会議参加者の近くに、指向性を有するマイクロホンを設置し、発話者の近くに設置されたマイクロホンのみをオンにする方法が知られている。また、他の方法として、マイクロホンを複数個所に設置し、全てのマイクロホンの出力をミキシングする方法も知られている。 In a voice conference system or a video conference system (hereinafter referred to as “voice conference system”), a voice call is performed by transmitting voice collected by a microphone to a communication partner. In order to realize a comfortable voice call, it is important to collect the spoken voice with high accuracy. Conventionally, as a method for collecting spoken voice, a method is known in which a microphone having directivity is installed near a conference participant and only a microphone installed near the speaker is turned on. As another method, there is also known a method of installing microphones at a plurality of locations and mixing the outputs of all the microphones.

しかしながら、上記従来の方法では、マイクロホンが雑音（紙をめくる音や椅子を移動する音）を集音し、集音された雑音が通信相手に送信される、という問題があった。 However, the above-described conventional method has a problem that the microphone collects noise (sound of turning paper or moving chair), and the collected noise is transmitted to the communication partner.

本発明は、上記の課題に鑑みてなされたものであり、雑音の送信を抑制することを目的とする。 The present invention has been made in view of the above problems, and an object thereof is to suppress transmission of noise.

一実施形態に係る音響信号処理装置は、複数のマイクロホンが出力した複数チャネルの音響信号に基づいて、対象方向から到来した音に対応する音響信号である目標信号を、複数の前記対象方向について生成するビームフォーマ部と、前記各対象方向の前記目標信号に基づいて、前記各対象方向の特徴量を計算する特徴量計算部と、前記各対象方向の前記特徴量に基づいて、前記複数の対象方向の中から出力方向を選択する方向選択部と、を備える。 An acoustic signal processing device according to an embodiment generates a target signal, which is an acoustic signal corresponding to a sound arriving from a target direction, for a plurality of target directions, based on a plurality of channel acoustic signals output from a plurality of microphones. A beamformer unit that calculates a feature value in each target direction based on the target signal in each target direction, and the plurality of targets based on the feature value in each target direction. A direction selection unit that selects an output direction from the directions.

本発明の各実施形態によれば、雑音の送信を抑制することができる。 According to each embodiment of the present invention, transmission of noise can be suppressed.

第１実施形態に係る音響信号処理装置の機能構成の一例を示す図。The figure which shows an example of a function structure of the acoustic signal processing apparatus which concerns on 1st Embodiment. 音源から２つのマイクロホンまでの距離を説明する図。The figure explaining the distance from a sound source to two microphones. 図２の２つのマイクロホンからの音響信号を示す図。The figure which shows the acoustic signal from two microphones of FIG. 第１実施形態に係る音響信号処理装置のハードウェア構成の一例を示す図。The figure which shows an example of the hardware constitutions of the acoustic signal processing apparatus which concerns on 1st Embodiment. 第１実施形態に係る音響信号処理装置の動作の一例を示すフローチャート。The flowchart which shows an example of operation | movement of the acoustic signal processing apparatus which concerns on 1st Embodiment. 第１実施形態に係る音響信号処理装置の動作の具体例を説明する図。The figure explaining the specific example of operation | movement of the acoustic signal processing apparatus which concerns on 1st Embodiment. 第２実施形態に係る音響信号処理装置の機能構成の一例を示す図。The figure which shows an example of a function structure of the acoustic signal processing apparatus which concerns on 2nd Embodiment. 第２実施形態に係る音響信号処理装置の動作の一例を示すフローチャート。The flowchart which shows an example of operation | movement of the acoustic signal processing apparatus which concerns on 2nd Embodiment. 第３実施形態における特徴量の計算方法の具体例を説明する図。The figure explaining the specific example of the calculation method of the feature-value in 3rd Embodiment. 第４実施形態における第１の選択方法の一例を示すフローチャート。The flowchart which shows an example of the 1st selection method in 4th Embodiment. 第４実施形態における第２の選択方法の一例を示すフローチャート。The flowchart which shows an example of the 2nd selection method in 4th Embodiment. 第４実施形態における第２の選択方法の具体例を示す図。The figure which shows the specific example of the 2nd selection method in 4th Embodiment.

以下、本発明の各実施形態について、添付の図面を参照しながら説明する。なお、各実施形態に係る明細書及び図面の記載に関して、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重畳した説明を省略する。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In addition, regarding the description of the specification and the drawings according to each embodiment, constituent elements having substantially the same functional configuration are denoted by the same reference numerals and overlapping description is omitted.

＜第１実施形態＞
第１実施形態に係る音響信号処理装置１について、図１〜図６を参照して説明する。本実施形態に係る音響信号処理装置１は、複数の無指向性（全指向性）マイクロホンにより集音し、無指向性マイクロホンが出力した複数チャネルの音響信号に所定の処理を実行し、音源の有無を判定し、出力方向を選択する。音響信号処理装置１は、音声会議システム等に適用可能である。 <First Embodiment>
The acoustic signal processing apparatus 1 according to the first embodiment will be described with reference to FIGS. The acoustic signal processing apparatus 1 according to the present embodiment collects sound with a plurality of omnidirectional (omnidirectional) microphones, executes predetermined processing on the acoustic signals of a plurality of channels output by the omnidirectional microphone, The presence or absence is determined and the output direction is selected. The acoustic signal processing apparatus 1 can be applied to an audio conference system or the like.

（音響信号処理装置１の構成）
図１は、本実施形態に係る音響信号処理装置１の機能構成の一例を示す図である。図１の音響信号処理装置１は、集音部１１と、音響信号記憶部１２と、ビームフォーマ部１３と、候補信号記憶部１４と、特徴量計算部１５と、特徴量記憶部１６と、方向選択部１７と、出力部１８と、を備える。 (Configuration of acoustic signal processing apparatus 1)
FIG. 1 is a diagram illustrating an example of a functional configuration of an acoustic signal processing device 1 according to the present embodiment. The acoustic signal processing device 1 in FIG. 1 includes a sound collection unit 11, an acoustic signal storage unit 12, a beam former unit 13, a candidate signal storage unit 14, a feature amount calculation unit 15, a feature amount storage unit 16, A direction selection unit 17 and an output unit 18 are provided.

集音部１１は、外部の音を集音し、集音した音に応じた音響信号（電気信号）を出力する。集音部１１は、指向性を有さず、全方向の音を集音する。後述する通り、集音部１１は、複数の無指向性マイクロホンにより実現される。したがって、集音部１１は、複数チャネルの音響信号を出力する。 The sound collection unit 11 collects external sound and outputs an acoustic signal (electric signal) corresponding to the collected sound. The sound collection unit 11 does not have directivity and collects sound in all directions. As will be described later, the sound collection unit 11 is realized by a plurality of omnidirectional microphones. Therefore, the sound collection unit 11 outputs a plurality of channels of acoustic signals.

音響信号記憶部１２は、集音部１１が出力した複数チャネルの音響信号を、チャネル毎に記憶する。音響信号記憶部１２は、例えば、ＦＩＦＯ（First in First Out）のリングバッファにより構成される。 The acoustic signal storage unit 12 stores the acoustic signals of a plurality of channels output from the sound collection unit 11 for each channel. The acoustic signal storage unit 12 includes, for example, a FIFO (First in First Out) ring buffer.

ビームフォーマ部１３は、集音部１１が出力した複数チャネルの音響信号に対して、ビームフォーミング処理を実行する。ビームフォーミング処理は、複数チャネルの音響信号に基づいて、対象方向Ｄの目標信号Ｓを生成する処理と、対象方向Ｄに死角（ヌル点）を形成することにより雑音信号Ｎを生成する処理と、を含む。 The beam former unit 13 performs beam forming processing on the multi-channel acoustic signals output from the sound collecting unit 11. The beam forming processing includes processing for generating a target signal S in the target direction D based on acoustic signals of a plurality of channels, processing for generating a noise signal N by forming a blind spot (null point) in the target direction D, including.

ここで、ビームフォーミング処理の概要について、図２及び図３を参照して説明する。図２は、音源ＳＳから２つのマイクロホンＭ１，Ｍ２までの距離を説明する図である。図３は、図２のマイクロホンＭ１，Ｍ２が出力する音響信号を示す図である。 Here, an outline of the beam forming process will be described with reference to FIGS. FIG. 2 is a diagram for explaining the distance from the sound source SS to the two microphones M1 and M2. FIG. 3 is a diagram illustrating acoustic signals output from the microphones M1 and M2 of FIG.

図２に示すように、ある対象方向Ｄに音源ＳＳが存在する場合、当該音源ＳＳからの音は、マイクロホンＭ１，Ｍ２により集音される。このとき、音源ＳＳからマイクロホンＭ１までの距離と、音源ＳＳからマイクロホンＭ２までの距離と、は異なるため、音源ＳＳからの音は、マイクロホンＭ１，Ｍ２に異なったタイミングで到達する。 As shown in FIG. 2, when the sound source SS exists in a certain target direction D, the sound from the sound source SS is collected by the microphones M1 and M2. At this time, since the distance from the sound source SS to the microphone M1 is different from the distance from the sound source SS to the microphone M2, the sound from the sound source SS arrives at the microphones M1 and M2 at different timings.

例えば、図２の例では、音源ＳＳからマイクロホンＭ２までの距離は、音源ＳＳからマイクロホンＭ１までの距離より、ｄｓｉｎθだけ長い。ｄは、マイクロホンＭ１，Ｍ２の設置間隔である。θは、マイクロホンＭ１，Ｍ２を通る直線と直交する線と、音源ＳＳ及びマイクロホンＭ１，Ｍ２を通る直線と、のなす角である。 For example, in the example of FIG. 2, the distance from the sound source SS to the microphone M2 is longer by dsin θ than the distance from the sound source SS to the microphone M1. d is an installation interval of the microphones M1 and M2. θ is an angle formed by a line orthogonal to a straight line passing through the microphones M1 and M2 and a straight line passing through the sound source SS and the microphones M1 and M2.

したがって、音源ＳＳからの音は、マイクロホンＭ２に、マイクロホンＭ１よりｄｓｉｎθ／ｃだけ遅れて到達する。ｃは、音速である。この結果、図３に示すように、マイクロホンＭ２は、マイクロホンＭ１より、ｄｓｉｎθ／ｃだけ遅延して、音源ＳＳからの音に対応する音響信号を出力する。 Therefore, the sound from the sound source SS arrives at the microphone M2 with a delay of dsin θ / c from the microphone M1. c is the speed of sound. As a result, as shown in FIG. 3, the microphone M2 outputs an acoustic signal corresponding to the sound from the sound source SS with a delay of dsin θ / c from the microphone M1.

このため、マイクロホンＭ２が出力した音響信号に、マイクロホンＭ１が出力した音響信号をｄｓｉｎθ／ｃだけ遅延させて加算することにより、音源ＳＳからの音に対応する音響信号を強調した信号（目標信号Ｓ）を生成することができる。また、マイクロホンＭ２が出力した音響信号から、マイクロホンＭ１が出力した音響信号をｄｓｉｎθ／ｃだけ遅延させて減算することにより、音源ＳＳからの音に対応する音響信号を低減した信号（雑音信号Ｎ）を生成することができる。 For this reason, by adding the acoustic signal output from the microphone M1 with a delay of dsin θ / c to the acoustic signal output from the microphone M2, a signal (target signal S) that emphasizes the acoustic signal corresponding to the sound from the sound source SS is added. ) Can be generated. Further, a signal (noise signal N) obtained by reducing the acoustic signal corresponding to the sound from the sound source SS by subtracting the acoustic signal output from the microphone M1 by delaying by dsin θ / c from the acoustic signal output from the microphone M2. Can be generated.

このように、ビームフォーマ部１３は、複数のマイクロホンからの音響信号を、それぞれ所定の遅延時間だけ遅延させて加算及び減算することにより、対象方向Ｄの目標信号Ｓ及び雑音信号Ｎを生成する。 As described above, the beamformer unit 13 generates the target signal S and the noise signal N in the target direction D by adding and subtracting the acoustic signals from the plurality of microphones with a predetermined delay time, respectively.

なお、図２の例では、音源ＳＳがマイクロホンＭ１，Ｍ２から十分に遠い場合を想定しているため、音源ＳＳ及びマイクロホンＭ１を通る直線と、音源ＳＳ及びマイクロホンＭ２を通る直線と、は平行である。また、図２の例では、集音部１１が２つのマイクロホンＭ１，Ｍ２により実現される場合を示したが、集音部１１は、３つ以上のマイクロホンにより実現されてもよい。 In the example of FIG. 2, since it is assumed that the sound source SS is sufficiently far from the microphones M1 and M2, the straight line passing through the sound source SS and the microphone M1 is parallel to the straight line passing through the sound source SS and the microphone M2. is there. In the example of FIG. 2, the case where the sound collection unit 11 is realized by two microphones M1 and M2 is shown, but the sound collection unit 11 may be realized by three or more microphones.

対象方向Ｄは、予め複数設定される。以下では、１番目からｎ番目までのｎ個の対象方向Ｄが設定されるものとする（図６の場合はｎ＝８）。ｉ番目の対象方向Ｄを、対象方向Ｄｉ（１≦ｉ≦ｎ）とする。対象方向Ｄ１〜Ｄｎは、それぞれ異なる方向である。また、対象方向Ｄｉの目標信号Ｓ及び雑音信号Ｎを、それぞれ目標信号Ｓｉ及び雑音信号Ｎｉとする。 A plurality of target directions D are set in advance. Hereinafter, n target directions D from the first to the nth are set (n = 8 in the case of FIG. 6). The i-th target direction D is set as a target direction Di (1 ≦ i ≦ n). The target directions D1 to Dn are different directions. Further, the target signal S and the noise signal N in the target direction Di are set as the target signal Si and the noise signal Ni, respectively.

目標信号Ｓｉは、対象方向Ｄｉから到来した音に対応する音響信号である。目標信号Ｓｉの生成処理は、集音部１１が出力した、全方向から到来した音に対応する音響信号の中から、対象方向Ｄｉから到来した音に対応する音響信号を抽出する処理に相当する。 The target signal Si is an acoustic signal corresponding to the sound arriving from the target direction Di. The processing for generating the target signal Si corresponds to processing for extracting the acoustic signal corresponding to the sound arriving from the target direction Di from the acoustic signals corresponding to the sound arriving from all directions output from the sound collecting unit 11. .

ビームフォーマ部１３は、複数チャネルの音響信号を、それぞれ所定の遅延時間だけ遅延させて加算することにより、目標信号Ｓｉを生成する。すなわち、ビームフォーマ部１３は、対象方向Ｄｉから音が到来した場合に、その音に対応する音響信号が強調されるように、各対象方向Ｄｉの音響信号を遅延させて加算することで、目標信号Ｓｉを生成する。こうして生成される目標信号Ｓｉは、ビーム点（高感度方向）を対象方向Ｄｉに形成された、遅延和（Delay and Sum）ビームフォーマの加算器出力に相当する。 The beamformer unit 13 generates the target signal Si by adding the plurality of channels of acoustic signals after delaying them by a predetermined delay time. That is, when sound arrives from the target direction Di, the beamformer unit 13 delays and adds the acoustic signals in each target direction Di so that the acoustic signals corresponding to the sound are emphasized. A signal Si is generated. The target signal Si generated in this way corresponds to the adder output of a delay and sum beamformer in which the beam point (high sensitivity direction) is formed in the target direction Di.

音響信号の遅延時間は、対象方向Ｄｉから到来する音の位相が一致するように、対象方向Ｄｉごとに予め設定される。遅延時間は、上述したように、集音部１１を構成する複数の無指向性マイクロホンの設置間隔ｄ、対象方向Ｄｉ、及び音響信号を出力する無指向性マイクロホンの設置位置などに応じて設定可能である。 The delay time of the acoustic signal is set in advance for each target direction Di so that the phases of sounds coming from the target direction Di match. As described above, the delay time can be set according to the installation interval d of the plurality of omnidirectional microphones constituting the sound collection unit 11, the target direction Di, the installation position of the omnidirectional microphone that outputs an acoustic signal, and the like. It is.

ビームフォーマ部１３が上記のように複数チャネルの音響信号を所定の遅延時間だけ遅延させて加算すると、対象方向Ｄｉの音響信号は強調される。これに対して、対象方向Ｄｉ以外の方向の音響信号は、同じ遅延時間だけ遅延させて加算しても、対象方向Ｄｉの音響信号ほどには強調されない。つまり、結果として、対象方向Ｄｉの音響信号のみが強調された信号が得られる。ビームフォーマ部１３は、こうして得られた信号を、目標信号Ｓｉとして出力する。 When the beamformer unit 13 delays and adds the acoustic signals of a plurality of channels by a predetermined delay time as described above, the acoustic signal in the target direction Di is emphasized. On the other hand, the acoustic signals in directions other than the target direction Di are not enhanced as much as the acoustic signals in the target direction Di even if they are delayed and added by the same delay time. That is, as a result, a signal in which only the acoustic signal in the target direction Di is enhanced is obtained. The beam former 13 outputs the signal thus obtained as the target signal Si.

なお、目標信号Ｓｉの生成方法は上記の方法に限られない。例えば、ビームフォーマ部１３は、各チャネルの音響信号を加算する前に、各チャネルの音響信号を増幅して信号レベルを調整してもよいし、各チャネルの音響信号をフィルタリングして不要な周波数成分を除去してもよい。 The method for generating the target signal Si is not limited to the above method. For example, the beam former 13 may amplify the acoustic signal of each channel and adjust the signal level before adding the acoustic signal of each channel, or filter the acoustic signal of each channel to remove unnecessary frequencies. Components may be removed.

雑音信号Ｎｉは、対象方向Ｄｉとは異なる方向、すなわち、対象方向Ｄｉ以外の方向から到来した音に対応する音響信号である。雑音信号Ｎｉは、目標信号Ｓｉに含まれ得る雑音成分に相当する。 The noise signal Ni is an acoustic signal corresponding to a sound arriving from a direction different from the target direction Di, that is, a direction other than the target direction Di. The noise signal Ni corresponds to a noise component that can be included in the target signal Si.

本実施形態において、雑音信号Ｎｉの生成処理は、集音部１１が出力した、全方向から到来した音に対応する音響信号の中から、対象方向Ｄｉから到来した音に対応する音響信号を除去する処理に相当する。言い換えると、雑音信号Ｎｉの生成処理は、対象方向Ｄｉにヌル点（低感度方向）を形成することにより、目標方向Ｄｉ以外の音響信号を取得する処理に相当する。 In the present embodiment, the generation process of the noise signal Ni removes the acoustic signal corresponding to the sound arriving from the target direction Di from the acoustic signals corresponding to the sound arriving from all directions output from the sound collection unit 11. It corresponds to the processing. In other words, the generation process of the noise signal Ni corresponds to a process of acquiring an acoustic signal other than the target direction Di by forming a null point (low sensitivity direction) in the target direction Di.

具体的には、本実施形態において、ビームフォーマ部１３は、複数チャネルの音響信号を、それぞれ所定の遅延時間だけ遅延させて減算することにより、対象方向Ｄｉから到来した音に対応する音響信号を除去することで雑音信号Ｎｉを生成する。こうして生成される雑音信号Ｓｉは、ヌル点を対象方向Ｄｉに形成された、遅延和（Delay and Sum）ビームフォーマの減算器出力に相当する。音響信号の遅延時間の設定方法は、目標信号Ｓｉと同様である。 Specifically, in the present embodiment, the beamformer unit 13 subtracts the acoustic signals of a plurality of channels by delaying each by a predetermined delay time, thereby obtaining an acoustic signal corresponding to the sound arriving from the target direction Di. The noise signal Ni is generated by removing the noise signal Ni. The noise signal Si generated in this way corresponds to a subtracter output of a delay and sum beamformer having a null point formed in the target direction Di. The method for setting the delay time of the acoustic signal is the same as that for the target signal Si.

ビームフォーマ部１３が上記のように複数チャネルの音響信号を減算すると、対象方向Ｄｉの音響信号は低減される。これに対して、対象方向Ｄｉ以外の方向の音響信号は、対象方向Ｄｉの音響信号ほどには低減されない。結果として、対象方向Ｄｉの音響信号が低減された信号が得られる。ビームフォーマ部１３は、こうして得られた信号を、雑音信号Ｎｉとして出力する。 When the beamformer unit 13 subtracts the sound signals of a plurality of channels as described above, the sound signal in the target direction Di is reduced. On the other hand, acoustic signals in directions other than the target direction Di are not reduced as much as acoustic signals in the target direction Di. As a result, a signal in which the acoustic signal in the target direction Di is reduced is obtained. The beam former 13 outputs the signal thus obtained as a noise signal Ni.

なお、雑音信号Ｎｉの生成方法は上記の方法に限られない。例えば、ビームフォーマ部１３は、各チャネルの音響信号を減算する前に、各チャネルの音響信号を増幅して信号レベルを調整してもよいし、各チャネルの音響信号をフィルタリングして不要な周波数成分を除去してもよい。 Note that the method of generating the noise signal Ni is not limited to the above method. For example, before subtracting the acoustic signal of each channel, the beamformer unit 13 may amplify the acoustic signal of each channel to adjust the signal level, or filter the acoustic signal of each channel to remove unnecessary frequencies. Components may be removed.

ビームフォーマ部１３は、各対象方向Ｄｉについて、ビームフォーミング処理を実行し、対象信号Ｓｉ及び雑音信号Ｎｉをそれぞれ生成する。ビームフォーマ部１３は、各対象方向Ｄｉに対するビームフォーミング処理を、順番に実行してもよいし、並列して実行してもよい。 The beam former 13 performs beam forming processing for each target direction Di, and generates a target signal Si and a noise signal Ni, respectively. The beam former 13 may execute the beam forming process for each target direction Di in order or in parallel.

候補信号記憶部１４は、各対象方向Ｄｉの候補信号を記憶する。候補信号は、出力部１８により出力される出力信号の候補となる信号である。本実施形態では、候補信号記憶部１４は、各対象方向Ｄｉの候補信号として、ビームフォーマ部１３が生成した各対象方向Ｄｉの目標信号Ｓｉを記憶する。 The candidate signal storage unit 14 stores candidate signals for each target direction Di. The candidate signal is a signal that is a candidate for the output signal output by the output unit 18. In the present embodiment, the candidate signal storage unit 14 stores the target signal Si for each target direction Di generated by the beamformer unit 13 as a candidate signal for each target direction Di.

特徴量計算部１５は、各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉに基づいて、各対象方向Ｄｉの特徴量Ｃを計算する。対象方向Ｄｉの特徴量Ｃを、特徴量Ｃｉとする。特徴量Ｃｉは、対象方向Ｄｉの音の状態を示す、目標信号Ｓｉ及び雑音信号Ｎｉに基づいて計算可能な任意の音響特徴量である。特徴量Ｃｉは、時間領域の音響特徴量であってもよいし、周波数領域の音響特徴量であってもよい。周波数領域の特徴量Ｃｉを計算する場合、特徴量計算部１５は、高速フーリエ変換（ＦＦＴ：Fast Fourier Transform）を実行可能であるのが好ましい。これにより、特徴量Ｃｉを短時間で計算できる。 The feature quantity calculation unit 15 calculates the feature quantity C in each target direction Di based on the target signal Si and noise signal Ni in each target direction Di. The feature amount C in the target direction Di is set as the feature amount Ci. The feature amount Ci is an arbitrary acoustic feature amount that can be calculated based on the target signal Si and the noise signal Ni that indicates the state of the sound in the target direction Di. The feature quantity Ci may be a time domain acoustic feature quantity or a frequency domain acoustic feature quantity. When calculating the feature quantity Ci in the frequency domain, it is preferable that the feature quantity calculation unit 15 is capable of executing a fast Fourier transform (FFT). Thereby, the feature amount Ci can be calculated in a short time.

特徴量記憶部１６は、特徴量計算部１５が計算した各対象方向Ｄｉの特徴量Ｃｉを記憶する。 The feature amount storage unit 16 stores the feature amount Ci of each target direction Di calculated by the feature amount calculation unit 15.

方向選択部１７は、各対象方向Ｄｉの特徴量Ｃｉに基づいて、各対象方向Ｄｉにおける音源の有無を判定する。また、方向選択部１７は、音源の有無の判定結果と、特徴量Ｃｉと、に基づいて、出力方向Ｄｏｕｔを選択する。出力方向Ｄｏｕｔは、出力部１８が出力信号として出力する候補信号の対象方向Ｄｉである。出力方向Ｄｏｕｔは、ｎ個の対象方向Ｄｉのいずれか１つである。 The direction selection unit 17 determines the presence or absence of a sound source in each target direction Di based on the feature amount Ci of each target direction Di. In addition, the direction selection unit 17 selects the output direction Dout based on the determination result of the presence or absence of the sound source and the feature amount Ci. The output direction Dout is the target direction Di of the candidate signal that the output unit 18 outputs as an output signal. The output direction Dout is any one of the n target directions Di.

方向選択部１７は、音源が存在すると判定した場合、音源が存在すると判定された対象方向Ｄｉの中で、音源から到来する音が最大の対象方向Ｄｉを、出力方向Ｄｏｕｔに選択する。一方、方向選択部１７は、音源が存在しないと判定した場合、対象方向Ｄｉの中で、到来する音が最小の対象方向Ｄｉ（本実施形態では、雑音信号Ｎｉが最も小さい対象方向Ｄｉ）を、出力方向Ｄｏｕｔとして選択する。 When it is determined that a sound source is present, the direction selection unit 17 selects, as the output direction Dout, the target direction Di having the maximum sound coming from the sound source among the target directions Di determined to be present. On the other hand, when the direction selection unit 17 determines that there is no sound source, the target direction Di having the smallest incoming sound in the target direction Di (the target direction Di having the smallest noise signal Ni in the present embodiment) is selected. The output direction Dout is selected.

出力部１８は、出力方向Ｄｏｕｔの候補信号を、出力信号として出力する。本実施形態では、候補信号は目標信号Ｓｉであるため、出力部１８は、出力方向Ｄｏｕｔの目標信号Ｓｏｕｔを、出力信号として出力する。出力信号の出力は、出力信号を音響信号処理装置１の外部装置に対して出力することや、音響信号処理装置１の音響出力装置（スピーカ）から出力信号を出力することを含む。 The output unit 18 outputs the candidate signal in the output direction Dout as an output signal. In the present embodiment, since the candidate signal is the target signal Si, the output unit 18 outputs the target signal Sout in the output direction Dout as an output signal. The output of the output signal includes outputting the output signal to an external device of the acoustic signal processing device 1 and outputting an output signal from the acoustic output device (speaker) of the acoustic signal processing device 1.

図４は、本実施形態に係る音響信号処理装置１のハードウェア構成の一例を示す図である。図４の音響信号処理装置１は、マイクロホンアレイ１００と、コンピュータ２００と、を備える。 FIG. 4 is a diagram illustrating an example of a hardware configuration of the acoustic signal processing device 1 according to the present embodiment. The acoustic signal processing device 1 in FIG. 4 includes a microphone array 100 and a computer 200.

マイクロホンアレイ１００は、所定の設置間隔ｄで設置された、複数の無指向性マイクロホンにより構成されるアレイであり、コンピュータ２００に接続される。マイクロホンアレイ１００の各無指向性マイクロホンは、集音した音に応じた音響信号をそれぞれ出力する。マイクロホンアレイ１００がＭ個の無指向性マイクロホンを備える場合、マイクロホンアレイ１００は、Ｍチャネルの音響信号を出力する。集音部１１は、マイクロホンアレイ１００により実現される。 The microphone array 100 is an array composed of a plurality of omnidirectional microphones installed at a predetermined installation interval d, and is connected to the computer 200. Each omnidirectional microphone of the microphone array 100 outputs an acoustic signal corresponding to the collected sound. When the microphone array 100 includes M omnidirectional microphones, the microphone array 100 outputs an M-channel acoustic signal. The sound collection unit 11 is realized by the microphone array 100.

コンピュータ２００は、プロセッサ２０１と、メモリ２０２と、マイクロホンインタフェース２０３と、入力装置２０４と、表示装置２０５と、通信装置２０６と、音響出力装置２０７と、バス２０８と、を備える。 The computer 200 includes a processor 201, a memory 202, a microphone interface 203, an input device 204, a display device 205, a communication device 206, a sound output device 207, and a bus 208.

プロセッサ２０１は、メモリ２０２に記憶されたプログラムを実行し、音響信号処理装置１のビームフォーマ部１３、特徴量計算部１５、方向選択部１７、及び出力部１８を実現する。プロセッサ２０１は、例えば、ＣＰＵ（Central Processing Unit）、ＤＳＰ（Digital Signal Processor）、ＡＳＩＣ（Application Specified Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）、ＰＬＤ（Programmable Logic Device）などである。 The processor 201 executes a program stored in the memory 202, and realizes the beam former unit 13, the feature amount calculation unit 15, the direction selection unit 17, and the output unit 18 of the acoustic signal processing device 1. The processor 201 is, for example, a central processing unit (CPU), a digital signal processor (DSP), an application specified integrated circuit (ASIC), a field programmable gate array (FPGA), or a programmable logic device (PLD).

メモリ２０２は、プロセッサ２０１が実行するプログラムや、各種のデータを記憶する。メモリ２０２は、音響信号記憶部１２、候補信号記憶部１４、及び特徴量記憶部１６を実現する。メモリ２０２は、例えば、ＲＡＭ（Random Access Memory）、ＤＲＡＭ（Dynamic RAM）、ＳＲＡＭ(Static RAM)、ハードディスク、光ディスク、フラッシュメモリなどである。 The memory 202 stores programs executed by the processor 201 and various data. The memory 202 implements the acoustic signal storage unit 12, the candidate signal storage unit 14, and the feature amount storage unit 16. The memory 202 is, for example, a RAM (Random Access Memory), a DRAM (Dynamic RAM), an SRAM (Static RAM), a hard disk, an optical disk, a flash memory, or the like.

マイクロホンインタフェース２０３は、マイクロホンアレイ１００とコンピュータ２００との間の通信を仲介するインタフェースである。マイクロホンインタフェース２０３は、ＡＤ（Analog to Digital）変換器を含む、アナログフロントエンド（ＡＦＥ：Analog Front End）である。マイクロホンインタフェース２０３は、マイクロホンアレイ１００が出力したアナログ信号（音響信号）を、デジタル信号に変換して、メモリ２０２に格納する。また、マイクロホンインタフェース２０３は、プロセッサ２０１からの制御信号を、マイクロホンアレイ１００に入力する。なお、マイクロホンインタフェース２０３は、プログラム（ソフトウェア）により実現されてもよい。 The microphone interface 203 is an interface that mediates communication between the microphone array 100 and the computer 200. The microphone interface 203 is an analog front end (AFE) including an AD (Analog to Digital) converter. The microphone interface 203 converts an analog signal (acoustic signal) output from the microphone array 100 into a digital signal and stores it in the memory 202. The microphone interface 203 inputs a control signal from the processor 201 to the microphone array 100. The microphone interface 203 may be realized by a program (software).

入力装置２０４は、例えば、キーボード、マウス、押しボタン、タッチパネルなどである。ユーザは、入力装置２０４を介して、音響信号処理装置１を操作することができる。 The input device 204 is, for example, a keyboard, a mouse, a push button, a touch panel, or the like. A user can operate the acoustic signal processing device 1 via the input device 204.

表示装置２０５は、例えば、液晶ディスプレイ、プラズマディスプレイ、ブラウン管ディスプレイ、ランプなどである。表示装置２０５は、出力方向Ｄｏｕｔなどを表示してもよい。 The display device 205 is, for example, a liquid crystal display, a plasma display, a cathode ray tube display, a lamp, or the like. The display device 205 may display the output direction Dout and the like.

通信装置２０６は、例えば、モデム、ハブ、及びルータなどである。音響信号処理装置１の出力信号は、通信装置２０６を介して外部装置に送信される。 The communication device 206 is, for example, a modem, a hub, and a router. The output signal of the acoustic signal processing device 1 is transmitted to an external device via the communication device 206.

音響出力装置２０７は、スピーカ、ブザーなどである。音響出力装置２０７は、音響信号処理装置１の出力信号を出力してもよい。また、音響信号処理装置１が音声会議システム等に適用された場合、音響出力装置２０７は、通信相手から受信した音響信号を出力する。 The sound output device 207 is a speaker, a buzzer, or the like. The sound output device 207 may output the output signal of the sound signal processing device 1. When the acoustic signal processing device 1 is applied to an audio conference system or the like, the acoustic output device 207 outputs an acoustic signal received from a communication partner.

バス２０８は、プロセッサ２０１と、メモリ２０２と、マイクロホンインタフェース２０３と、入力装置２０４と、表示装置２０５と、通信装置２０６と、音響出力装置２０７と、を相互に接続する。 The bus 208 connects the processor 201, the memory 202, the microphone interface 203, the input device 204, the display device 205, the communication device 206, and the sound output device 207 to each other.

（音響信号処理装置１の動作）
図５は、本実施形態に係る音響信号処理装置１の動作の一例を示すフローチャートである。音響信号処理装置１は、所定の時間間隔で、図５の動作を実行する。以下では、音響信号処理装置１は、所定の時間幅を有するフレーム単位で動作するものとする。現フレームのフレーム番号をｐとする。また、音響信号処理装置１の動作中、集音部１１は音響信号を常時出力しているものとする。 (Operation of the acoustic signal processing apparatus 1)
FIG. 5 is a flowchart showing an example of the operation of the acoustic signal processing apparatus 1 according to the present embodiment. The acoustic signal processing apparatus 1 performs the operation of FIG. 5 at predetermined time intervals. In the following, it is assumed that the acoustic signal processing device 1 operates in units of frames having a predetermined time width. Let p be the frame number of the current frame. Further, it is assumed that the sound collection unit 11 constantly outputs an acoustic signal during the operation of the acoustic signal processing device 1.

現フレーム（ｐ）の開始時刻が到来すると、まず、音響信号記憶部１２が、集音部１１が出力している音響信号をチャネル毎に記憶する（ステップＳＴ１０１）。なお、図５の例では、音響信号記憶部１２は、１フレーム分以上の音響信号を記憶可能であればよい。 When the start time of the current frame (p) comes, first, the acoustic signal storage unit 12 stores the acoustic signal output by the sound collection unit 11 for each channel (step ST101). In the example of FIG. 5, the acoustic signal storage unit 12 only needs to be able to store an acoustic signal for one frame or more.

次に、ビームフォーマ部１３が、音響信号記憶部１２から各チャネルの音響信号を読み出し、読み出した音響信号に基づいて、対象方向Ｄｉ毎にビームフォーミング処理を実行する。これにより、現フレーム（ｐ）における各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉが生成される（ステップＳＴ１０２）。ビームフォーマ部１３は、生成した各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉを、特徴量計算部１５に渡す。なお、ビームフォーミング処理については、上述の通りである。 Next, the beam former unit 13 reads out the acoustic signal of each channel from the acoustic signal storage unit 12, and executes beam forming processing for each target direction Di based on the read out acoustic signal. Thereby, the target signal Si and the noise signal Ni in each target direction Di in the current frame (p) are generated (step ST102). The beam former 13 passes the generated target signal Si and noise signal Ni in each target direction Di to the feature amount calculator 15. The beam forming process is as described above.

また、ビームフォーマ部１３は、生成した目標信号Ｓｉを候補信号として、候補信号記憶部１４に格納する。候補信号記憶部１４は、格納された候補信号（目標信号Ｓｉ）を、対象方向Ｄｉ毎に記憶する（ステップＳＴ１０３）。 Further, the beam former unit 13 stores the generated target signal Si as a candidate signal in the candidate signal storage unit 14. The candidate signal storage unit 14 stores the stored candidate signal (target signal Si) for each target direction Di (step ST103).

続いて、特徴量計算部１５は、ビームフォーマ部１３から受け取った各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉに基づいて、各対象方向Ｄｉの特徴量Ｃｉを計算する（ステップＳＴ１０４）。現フレーム（ｐ）の特徴量Ｃｉを特徴量Ｃｉ（ｐ）とする。 Subsequently, the feature quantity calculation unit 15 calculates the feature quantity Ci of each target direction Di based on the target signal Si and noise signal Ni of each target direction Di received from the beam former unit 13 (step ST104). The feature amount Ci of the current frame (p) is defined as a feature amount Ci (p).

以下では、特徴量Ｃが信号対雑音比（ＳＮＲ：Signal to Noise Ratio）である場合を例に説明する。ＳＮＲは、雑音成分に対する信号成分の割合である。ＳＮＲは、音源から到来する音の大きさに相当する。以下、現フレーム（ｐ）の特徴量Ｃｉ（ｐ）を、ＳＮＲｉ（ｐ）とする。ＳＮＲｉ（ｐ）は、以下の式で計算される。 Hereinafter, a case where the characteristic amount C is a signal to noise ratio (SNR) will be described as an example. SNR is the ratio of the signal component to the noise component. The SNR corresponds to the volume of sound coming from the sound source. Hereinafter, the feature amount Ci (p) of the current frame (p) is assumed to be SNRi (p). SNRi (p) is calculated by the following equation.

式（１）において、ｆは周波数、Ｓｉ（ｆ）は目標信号Ｓｉに含まれる周波数ｆを有する成分の信号レベル、Ｎｉ（ｆ）は雑音信号Ｎｉに含まれる周波数ｆを有する成分の信号レベル、ｆｍｉｎは下限周波数、ｆｍａｘは上限周波数である。式（１）により計算されるＳＮＲｉ（ｐ）は、下限周波数ｆｍｉｎから上限周波数ｆｍａｘまでの帯域に帯域制限されたＳＮＲである。 In Equation (1), f is the frequency, Si (f) is the signal level of the component having the frequency f included in the target signal Si, Ni (f) is the signal level of the component having the frequency f included in the noise signal Ni, fmin is a lower limit frequency, and fmax is an upper limit frequency. SNRi (p) calculated by Expression (1) is an SNR that is band-limited to a band from the lower limit frequency fmin to the upper limit frequency fmax.

下限周波数ｆｍｉｎは、任意の周波数に設定できる。下限周波数ｆｍｉｎは、例えば、２０Ｈｚ（可聴領域の下限周波数）に設定してもよい。下限周波数ｆｍｉｎが高いほど、ＳＮＲｉ（ｐ）の計算量が減るため、ＳＮＲｉ（ｐ）の計算処理を高速化できる。 The lower limit frequency fmin can be set to an arbitrary frequency. For example, the lower limit frequency fmin may be set to 20 Hz (the lower limit frequency of the audible region). Since the calculation amount of SNRi (p) decreases as the lower limit frequency fmin increases, the calculation process of SNRi (p) can be speeded up.

上限周波数ｆｍａｘは、任意の周波数に設定できる。上限周波数ｆｍａｘは、例えば、２０ｋＨｚ（可聴領域の上限周波数）に設定してもよい。上限周波数ｆｍａｘが低いほど、ＳＮＲｉ（ｐ）の計算量が減るため、ＳＮＲｉ（ｐ）の計算処理を高速化できる。 The upper limit frequency fmax can be set to an arbitrary frequency. For example, the upper limit frequency fmax may be set to 20 kHz (upper limit frequency of the audible region). As the upper limit frequency fmax is lower, the calculation amount of SNRi (p) is reduced, so that the calculation process of SNRi (p) can be speeded up.

上限周波数ｆｍａｘは、目標信号Ｓｉ及び雑音信号Ｎｉに空間エイリアシングが発生しないように設定されるのが好ましい。ここでいう空間エイリアシングとは、無指向性マイクロホンの設置間隔ｄに起因した信号の折り返し歪みのことである。 The upper limit frequency fmax is preferably set so that no spatial aliasing occurs in the target signal Si and the noise signal Ni. Spatial aliasing here refers to signal aliasing caused by the installation interval d of omnidirectional microphones.

具体的には、上限周波数ｆｍａｘは、空間エイリアシングが発生する下限周波数ｆｎｙｑ（第１の周波数）未満に設定されるのが好ましい（ｆｍａｘ＜ｆｎｙｑ）。空間エイリアシングが発生する下限周波数ｆｎｙｑは、以下の式で計算される。 Specifically, the upper limit frequency fmax is preferably set to be lower than the lower limit frequency fnyq (first frequency) at which spatial aliasing occurs (fmax <fnyq). The lower limit frequency fnyq at which spatial aliasing occurs is calculated by the following equation.

式（２）において、ｄは無指向性マイクロホンの設置間隔、ｃは音速である。目標信号Ｓｉ及び雑音信号Ｎｉの、下限周波数ｆｎｙｑ以上の周波数ｆを有する周波数成分には、空間エイリアシングが発生する。これに対して、目標信号Ｓｉ及び雑音信号Ｎｉの、下限周波数ｆｎｙｑ未満の周波数ｆを有する周波数成分には、空間エイリアシングが発生しない。 In Expression (2), d is the installation interval of the omnidirectional microphone, and c is the speed of sound. Spatial aliasing occurs in the frequency component of the target signal Si and the noise signal Ni having a frequency f equal to or higher than the lower limit frequency fnyq. On the other hand, spatial aliasing does not occur in the frequency component having the frequency f less than the lower limit frequency fnyq of the target signal Si and the noise signal Ni.

したがって、上限周波数ｆｍａｘを下限周波数ｆｎｙｑ未満に設定することにより、目標信号Ｓｉ及び雑音信号Ｎｉの、空間エイリアシングが発生していない周波数成分を利用して、ＳＮＲｉ（ｐ）を計算できる。この結果、ＳＮＲｉ（ｐ）を精度よく計算することができる。 Therefore, by setting the upper limit frequency fmax to be less than the lower limit frequency fnyq, the SNRi (p) can be calculated using the frequency components of the target signal Si and the noise signal Ni in which no spatial aliasing has occurred. As a result, SNRi (p) can be calculated with high accuracy.

特徴量計算部１５は、計算した各対象方向ＤｉのＳＮＲｉ（ｐ）を特徴量記憶部１６に格納する。特徴量記憶部１６は、格納されたＳＮＲｉ（ｐ）を、対象方向Ｄｉ毎に記憶する。なお、図５の例では、特徴量記憶部１６は、ｊフレーム分のＳＮＲｉ（特徴量Ｃｉ）を記憶する（ｊ＞１）。 The feature quantity calculation unit 15 stores the calculated SNRi (p) of each target direction Di in the feature quantity storage unit 16. The feature amount storage unit 16 stores the stored SNRi (p) for each target direction Di. In the example of FIG. 5, the feature quantity storage unit 16 stores SNRi (feature quantity Ci) for j frames (j> 1).

次に、方向選択部１７が、特徴量記憶部１６から各対象方向Ｄｉの、現フレーム（ｐ）のＳＮＲｉ（ｐ）及び前フレーム（ｐ−１）のＳＮＲｉ（ｐ−１）を読み出し、ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量が、予め設定された閾値ＴＨ１以上であるか判定する（ステップＳＴ１０５）。すなわち、方向選択部１７は、以下の式が満たされるか判定する。 Next, the direction selection unit 17 reads the SNRi (p) of the current frame (p) and the SNRi (p−1) of the previous frame (p−1) for each target direction Di from the feature amount storage unit 16, and the SNRi It is determined whether the amount of change in SNRi (p) with respect to (p-1) is greater than or equal to a preset threshold value TH1 (step ST105). That is, the direction selection unit 17 determines whether the following expression is satisfied.

この判定は、対象方向Ｄｉの音の状態の変化を検知する処理に相当する。より詳細には、対象方向Ｄｉにおける音源の発生及び消滅を検知する処理に相当する。理由は以下のとおりである。 This determination corresponds to processing for detecting a change in the state of the sound in the target direction Di. More specifically, this corresponds to processing for detecting the generation and disappearance of a sound source in the target direction Di. The reason is as follows.

対象方向Ｄｉに、前フレーム（ｐ−１）において音源が存在しなかった場合、ＳＮＲｉ（ｐ−１）は小さな値となる。この対象方向Ｄｉに、現フレーム（ｐ）において音源が発生すると、ＳＮＲｉ（ｐ）は、ＳＮＲｉ（ｐ−１）に比べて大きな値となる。この結果、ＳＮＲｉ（ｐ）は、ＳＮＲｉ（ｐ−１）に対して急峻に変化（増大）する。 If no sound source exists in the target direction Di in the previous frame (p−1), SNRi (p−1) is a small value. When a sound source is generated in the target direction Di in the current frame (p), SNRi (p) has a larger value than SNRi (p−1). As a result, SNRi (p) changes (increases) abruptly with respect to SNRi (p−1).

対象方向Ｄｉに、前フレーム（ｐ−１）において音源が存在した場合、ＳＮＲｉ（ｐ−１）は大きな値となる。この対象方向Ｄｉに、現フレーム（ｐ）において音源が消滅すると、ＳＮＲｉ（ｐ）は、ＳＮＲｉ（ｐ−１）に比べて小さな値となる。この結果、ＳＮＲｉ（ｐ）は、ＳＮＲｉ（ｐ−１）に対して急峻に変化（減少）する。 When a sound source exists in the target direction Di in the previous frame (p−1), SNRi (p−1) has a large value. When the sound source disappears in the current direction (p) in the target direction Di, SNRi (p) becomes a smaller value than SNRi (p−1). As a result, SNRi (p) changes (decreases) sharply with respect to SNRi (p−1).

対象方向Ｄｉに、前フレーム（ｐ−１）において音源が存在しなかった場合、ＳＮＲｉ（ｐ−１）は小さな値となる。この対象方向Ｄｉに、現フレーム（ｐ）においても音源が存在しなかった場合、ＳＮＲｉ（ｐ）は、ＳＮＲｉ（ｐ−１）とほぼ変わらない値となる。この結果、ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量は、小さくなる。 If no sound source exists in the target direction Di in the previous frame (p−1), SNRi (p−1) is a small value. If no sound source exists in the target direction Di even in the current frame (p), SNRi (p) is a value that is substantially the same as SNRi (p−1). As a result, the amount of change in SNRi (p) with respect to SNRi (p−1) becomes small.

対象方向Ｄｉに、前フレーム（ｐ−１）において音源が存在した場合、ＳＮＲｉ（ｐ−１）は大きな値となる。この対象方向Ｄｉに、現フレーム（ｐ）においても音源が存在する場合、ＳＮＲｉ（ｐ）は、ＳＮＲｉ（ｐ−１）とほぼ変わらない値となる。この結果、ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量は、小さくなる。 When a sound source exists in the target direction Di in the previous frame (p−1), SNRi (p−1) has a large value. When a sound source is present in the target direction Di even in the current frame (p), SNRi (p) is a value that is substantially the same as SNRi (p−1). As a result, the amount of change in SNRi (p) with respect to SNRi (p−1) becomes small.

したがって、閾値ＴＨ１を適切に設定することにより、ＳＮＲｉ（ｐ）がＳＮＲｉ（ｐ−１）に対して急峻に変化した場合、すなわち、対象方向Ｄｉに音源が発生した場合及び対象方向Ｄｉから音源が消滅した場合、を検知することができる。閾値ＴＨ１の適切な値は、実験により決定すればよい。 Therefore, by appropriately setting the threshold TH1, when SNRi (p) changes sharply with respect to SNRi (p-1), that is, when a sound source is generated in the target direction Di and when the sound source is generated from the target direction Di. When it disappears, it can be detected. An appropriate value of the threshold value TH1 may be determined by experiment.

次に、方向選択部１７は、ステップＳＴ１０５の判定結果、すなわち、対象方向Ｄｉの音の状態変化の検知結果に基づいて、ＳＮＲｉ（ｐ）を補正する。補正されたＳＮＲｉ（ｐ）を、Ｃ＿ＳＮＲｉ（ｐ）とする。 Next, the direction selection part 17 correct | amends SNRi (p) based on the determination result of step ST105, ie, the detection result of the state change of the sound of the object direction Di. The corrected SNRi (p) is defined as C_SNRi (p).

ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量が閾値ＴＨ１未満である場合（ステップＳＴ１０５のＮＯ）、方向選択部１７は、Ｃ＿ＳＮＲｉ（ｐ）をｊフレーム分のＳＮＲｉの平均値に設定する（ステップＳＴ１０６）。ｊフレーム分のＳＮＲｉの平均値を、ｍｅａｎ＿ＳＮＲｉ（ｐ）とする。方向選択部１７は、特徴量記憶部１６からｊフレーム分のＳＮＲｉを読み出し、以下の式によりｍｅａｎ＿ＳＮＲｉ（ｐ）を計算する。 When the change amount of SNRi (p) with respect to SNRi (p−1) is less than the threshold TH1 (NO in step ST105), the direction selection unit 17 sets C_SNRi (p) to the average value of SNRi for j frames. (Step ST106). Let the mean value of SNRi for j frames be mean_SNRi (p). The direction selection unit 17 reads SNRi for j frames from the feature amount storage unit 16 and calculates mean_SNRi (p) by the following equation.

方向選択部１７は、計算したｍｅａｎ＿ＳＮＲｉ（ｐ）を、Ｃ＿ＳＮＲｉ（ｐ）として特徴量記憶部１６に格納する。特徴量記憶部１６は、方向選択部１７に格納されたｍｅａｎ＿ＳＮＲｉ（ｐ）を、Ｃ＿ＳＮＲｉ（ｐ）として記憶する（ステップＳＴ１０７）。以降、方向選択部１７は、Ｃ＿ＳＮＲｉ（ｐ）に基づいて、出力方向Ｄｏｕｔを選択する。 The direction selection unit 17 stores the calculated mean_SNRi (p) in the feature amount storage unit 16 as C_SNRi (p). The feature amount storage unit 16 stores mean_SNRi (p) stored in the direction selection unit 17 as C_SNRi (p) (step ST107). Thereafter, the direction selection unit 17 selects the output direction Dout based on C_SNRi (p).

このように、平均化されたＳＮＲｉに基づいて出力方向Ｄｏｕｔを選択することにより、瞬間的なＳＮＲｉの揺らぎに起因する出力方向Ｄｏｕｔの誤選択を、抑制することができる。例えば、発話者のいいどよみや無声音の発話による、ＳＮＲｉ（ｐ）の瞬間的な低下に起因する誤選択を抑制することができる。ここでいう誤選択とは、発話者のいる方向が、出力方向Ｄｏｕｔとして選択されないことをいう。 In this way, by selecting the output direction Dout based on the averaged SNRi, erroneous selection of the output direction Dout due to instantaneous SNRi fluctuation can be suppressed. For example, it is possible to suppress erroneous selection caused by an instantaneous decrease in SNRi (p) due to the utterance of the speaker or the utterance of unvoiced sound. The erroneous selection here means that the direction in which the speaker is present is not selected as the output direction Dout.

一方、ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量が閾値ＴＨ１以上である場合（ステップＳＴ１０５のＹＥＳ）、方向選択部１７は、Ｃ＿ＳＮＲｉ（ｐ）を、ＳＮＲｉ（ｐ）に設定する（ステップＳＴ１０８）。具体的には、方向選択部１７は、ＳＮＲｉ（ｐ）を、Ｃ＿ＳＮＲｉ（ｐ）として特徴量記憶部１６に格納する。特徴量記憶部１６は、方向選択部１７に格納されたＳＮＲｉ（ｐ）を、Ｃ＿ＳＮＲｉ（ｐ）として記憶する（ステップＳＴ１０７）。 On the other hand, when the amount of change in SNRi (p) with respect to SNRi (p−1) is greater than or equal to threshold value TH1 (YES in step ST105), direction selection unit 17 sets C_SNRI (p) to SNRi (p) ( Step ST108). Specifically, the direction selection unit 17 stores SNRi (p) in the feature amount storage unit 16 as C_SNRi (p). The feature amount storage unit 16 stores the SNRi (p) stored in the direction selection unit 17 as C_SNRi (p) (step ST107).

方向選択部１７は、全ての対象方向Ｄｉについて、Ｃ＿ＳＮＲｉ（ｐ）を設定する。続いて、方向選択部１７は、特徴量記憶部１６から各対象方向ＤｉのＣ＿ＳＮＲｉ（ｐ）を読み出し、Ｃ＿ＳＮＲｉ（ｐ）の最大値と最小値との差が、予め設定された閾値ＴＨ２以上であるか判定する（ステップＳＴ１０９）。すなわち、方向選択部１７は、以下の式が満たされるか判定する。 The direction selection unit 17 sets C_SNRi (p) for all target directions Di. Subsequently, the direction selection unit 17 reads C_SNRi (p) of each target direction Di from the feature amount storage unit 16, and the difference between the maximum value and the minimum value of C_SNRi (p) is greater than or equal to a preset threshold value TH2. It is determined whether there is any (step ST109). That is, the direction selection unit 17 determines whether the following expression is satisfied.

式（５）における、ｍａｘ（Ｃ＿ＳＮＲｉ（ｐ））は、Ｃ＿ＳＮＲｉ（ｐ）の最大値である。ｍｉｎ（Ｃ＿ＳＮＲｉ（ｐ））は、Ｃ＿ＳＮＲｉ（ｐ）の最小値である。 In formula (5), max (C_SNRi (p)) is the maximum value of C_SNRi (p). min (C_SNRi (p)) is the minimum value of C_SNRi (p).

この判定は、音源が存在すると判定するための処理に相当する。これは、音源が存在する場合、音源が存在する対象方向ＤｉのＳＮＲｉ（ｐ）と、音源が存在しない対象方向ＤｉのＳＮＲｉ（ｐ）と、の差が大きくなるためである。閾値ＴＨ２を適切に設定することにより、音源が存在すると判定することができる。閾値ＴＨ２の適切な値は、実験により決定すればよい。 This determination corresponds to a process for determining that a sound source exists. This is because when there is a sound source, the difference between SNRi (p) in the target direction Di where the sound source exists and SNRi (p) in the target direction Di where there is no sound source is large. By appropriately setting the threshold TH2, it can be determined that a sound source exists. An appropriate value of the threshold value TH2 may be determined by experiment.

Ｃ＿ＳＮＲｉ（ｐ）の最大値と最小値との差が閾値ＴＨ２以上である場合（ステップＳＴ１０９のＹＥＳ）、方向選択部１７は、音源が存在すると判定する。そして、方向選択部１７は、Ｃ＿ＳＮＲｉ（ｐ）が最大の対象方向Ｄｉを、出力方向Ｄｏｕｔとして選択する（ステップＳＴ１１０）。この出力方向Ｄｏｕｔは、存在すると判定された音源のうち、音源から到来する音が最大の対象方向Ｄｉの方向に相当する。 When the difference between the maximum value and the minimum value of C_SNRi (p) is greater than or equal to threshold value TH2 (YES in step ST109), direction selection unit 17 determines that a sound source exists. Then, the direction selection unit 17 selects the target direction Di having the maximum C_SNRi (p) as the output direction Dout (step ST110). The output direction Dout corresponds to the direction of the target direction Di where the sound coming from the sound source is the largest among the sound sources determined to be present.

Ｃ＿ＳＮＲｉ（ｐ）の最大値と最小値との差が閾値ＴＨ２未満である場合（ステップＳＴ１０９のＮＯ）、方向選択部１７は、Ｃ＿ＳＮＲｉ（ｐ）の最大値と最小値との差が、予め設定された閾値ＴＨ３以上であるか判定する（ステップＳＴ１１１）。閾値ＴＨ３は、閾値ＴＨ２より小さく設定される（ＴＨ２＞ＴＨ３）。すなわち、方向選択部１７は、以下の式が満たされるか判定する。 When the difference between the maximum value and the minimum value of C_SNRi (p) is less than the threshold value TH2 (NO in step ST109), the direction selection unit 17 sets the difference between the maximum value and the minimum value of C_SNRI (p) in advance. It is determined whether or not the threshold value TH3 is exceeded (step ST111). The threshold value TH3 is set smaller than the threshold value TH2 (TH2> TH3). That is, the direction selection unit 17 determines whether the following expression is satisfied.

この判定は、音源が存在しないと判定するための処理に相当する。これは、音源が存在しない場合、各対象方向ＤｉのＳＮＲｉ（ｐ）の差が小さくなるためである。閾値ＴＨ３を適切に設定することにより、音源が存在しないと判定することができる。閾値ＴＨ３の適切な値は、実験により決定すればよい。 This determination corresponds to a process for determining that there is no sound source. This is because when there is no sound source, the difference in SNRi (p) in each target direction Di is small. By appropriately setting the threshold TH3, it can be determined that there is no sound source. An appropriate value for the threshold TH3 may be determined by experiment.

Ｃ＿ＳＮＲｉ（ｐ）の最大値と最小値との差が閾値ＴＨ３未満である場合（ステップＳＴ１１１のＮＯ）、方向選択部１７は、音源が存在しないと判定する。そして、方向選択部１７は、Ｃ＿ＳＮＲｉ（ｐ）が最小の対象方向Ｄｉを、出力方向Ｄｏｕｔとして選択する（ステップＳＴ１１２）。この出力方向Ｄｏｕｔは、到来する雑音が最も小さい対象方向Ｄｉに相当する。 When the difference between the maximum value and the minimum value of C_SNRi (p) is less than the threshold value TH3 (NO in step ST111), the direction selection unit 17 determines that there is no sound source. Then, the direction selection unit 17 selects the target direction Di having the smallest C_SNRi (p) as the output direction Dout (step ST112). This output direction Dout corresponds to the target direction Di with the smallest incoming noise.

これは、音源が存在しない場合、各対象方向Ｄｉから到来している音は、いずれも雑音であり、Ｃ＿ＳＮＲｉ（ｐ）の大きさは、雑音の大きさに相当する、と考えられるためである。 This is because in the absence of a sound source, all sounds coming from each target direction Di are noise, and the magnitude of C_SNRi (p) is considered to correspond to the magnitude of noise. .

一方、Ｃ＿ＳＮＲｉ（ｐ）の最大値と最小値との差が閾値ＴＨ３以上である場合（ステップＳＴ１１１のＹＥＳ）、方向選択部１７は、音源の有無が不明であると判定する。そして、方向選択部１７は、前フレーム（ｐ−１）において出力方向Ｄｉとして選択した対象方向Ｄｉを、現フレーム（ｐ）の出力方向Ｄｏｕｔとして選択する（ステップＳＴ１１３）。 On the other hand, when the difference between the maximum value and the minimum value of C_SNRi (p) is greater than or equal to threshold value TH3 (YES in step ST111), direction selection unit 17 determines that the presence or absence of the sound source is unknown. Then, the direction selection unit 17 selects the target direction Di selected as the output direction Di in the previous frame (p-1) as the output direction Dout of the current frame (p) (step ST113).

方向選択部１７は、出力方向Ｄｏｕｔを選択すると、選択した出力方向Ｄｏｕｔを出力部１８に通知する。出力方向Ｄｏｕｔを通知された出力部１８は、候補信号記憶部１４から、出力方向Ｄｏｕｔの候補信号（目標信号Ｓｏｕｔ）を読み出し、読み出した候補信号を出力信号として出力する（ステップＳＴ１１４）。 When selecting the output direction Dout, the direction selection unit 17 notifies the output unit 18 of the selected output direction Dout. The output unit 18 notified of the output direction Dout reads the candidate signal (target signal Sout) in the output direction Dout from the candidate signal storage unit 14, and outputs the read candidate signal as an output signal (step ST114).

図６は、音響信号処理装置１の動作の具体例を説明する図である。図６の例では、８個の対象方向Ｄｉ（ｉ＝１〜８）が等間隔で設定されている。以下では、現フレーム（ｐ）において、対象方向Ｄ２に音源ＳＳが発生した場合を例に説明する。前フレーム（ｐ−１）には、音源ＳＳは存在しなかったものとする。 FIG. 6 is a diagram for explaining a specific example of the operation of the acoustic signal processing device 1. In the example of FIG. 6, eight target directions Di (i = 1 to 8) are set at equal intervals. Hereinafter, a case where the sound source SS is generated in the target direction D2 in the current frame (p) will be described as an example. It is assumed that no sound source SS exists in the previous frame (p-1).

現フレーム（ｐ）の開始時刻が到来すると、音響信号記憶部１２が、集音部１１が出力している音響信号をチャネル毎に記憶する（ステップＳＴ１０１）。各チャネルの音響信号には、対象方向Ｄ２の音源ＳＳから到来した音に対応する音響信号が含まれる。 When the start time of the current frame (p) comes, the acoustic signal storage unit 12 stores the acoustic signal output from the sound collection unit 11 for each channel (step ST101). The acoustic signal of each channel includes an acoustic signal corresponding to the sound arriving from the sound source SS in the target direction D2.

次に、ビームフォーマ部１３が、音響信号記憶部１２に記憶された音響信号に基づいて、ビームフォーミング処理を実行し、各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉを生成する（ステップＳＴ１０２）。これにより、目標信号Ｓ１〜Ｓ８と、雑音信号Ｎ１〜Ｎ８が生成される。音源ＳＳが対象方向Ｄ２に存在するため、目標信号Ｓ２が相対的に大きな値となる。目標信号Ｓ１〜Ｓ８は、候補信号記憶部１４に記憶される（ステップＳＴ１０３）。 Next, the beam former unit 13 performs beam forming processing based on the acoustic signal stored in the acoustic signal storage unit 12, and generates the target signal Si and the noise signal Ni in each target direction Di (step ST102). . Thereby, target signals S1 to S8 and noise signals N1 to N8 are generated. Since the sound source SS exists in the target direction D2, the target signal S2 has a relatively large value. The target signals S1 to S8 are stored in the candidate signal storage unit 14 (step ST103).

続いて、特徴量計算部１５が、目標信号Ｓ１及び雑音信号Ｎ１に基づいて、対象方向Ｄ１のＳＮＲ１（ｐ）を計算する（ステップＳＴ１０４）。同様の方法で、特徴量計算部１５は、ＳＮＲ２（ｐ）〜ＳＮＲ８（ｐ）を計算する。目標信号Ｓ２が相対的に大きな値であるため、ＳＮＲ２（ｐ）が相対的に大きな値となる。こうして計算されたＳＮＲｉ（ｐ）が、特徴量記憶部１６に記憶される。 Subsequently, the feature amount calculator 15 calculates SNR1 (p) in the target direction D1 based on the target signal S1 and the noise signal N1 (step ST104). In the same way, the feature amount calculation unit 15 calculates SNR2 (p) to SNR8 (p). Since the target signal S2 has a relatively large value, SNR2 (p) has a relatively large value. The SNRi (p) calculated in this way is stored in the feature amount storage unit 16.

方向選択部１７は、特徴量記憶部１６からＳＮＲｉ（ｐ）を読み出し、各対象方向Ｄｉについて、ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量が閾値ＴＨ１以上であるか判定する（ステップＳＴ１０５）。 The direction selection unit 17 reads SNRi (p) from the feature amount storage unit 16 and determines whether the change amount of SNRi (p) with respect to SNRi (p−1) is greater than or equal to the threshold TH1 for each target direction Di (step ST105).

対象方向Ｄ１，Ｄ３〜Ｄ８は、前フレーム（ｐ−１）にも現フレーム（ｐ）にも音源ＳＳが存在しないため、ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量が小さく、閾値ＴＨ１未満となる（ステップＳＴ１０５のＮＯ）。したがって、方向選択部１７は、対象方向Ｄ１，Ｄ３〜Ｄ８のＣ＿ＳＮＲｉ（ｐ）を、ｍｅａｎ＿ＳＮＲｉ（ｐ）に設定する（ステップＳＴ１０６）。 In the target directions D1, D3 to D8, since there is no sound source SS in the previous frame (p-1) or the current frame (p), the amount of change in SNRi (p) with respect to SNRi (p-1) is small, and the threshold value It becomes less than TH1 (NO in step ST105). Therefore, the direction selection unit 17 sets C_SNRi (p) of the target directions D1, D3 to D8 to mean_SNRi (p) (step ST106).

一方、対象方向Ｄ２は、現フレーム（ｐ）に音源ＳＳが発生したことにより、ＳＮＲｉ（ｐ−１）に対してＳＮＲｉ（ｐ）が急峻に増大しているため、ＳＮＲｉ（ｐ−１）に対するＳＮＲｉ（ｐ）の変化量が、閾値ＴＨ１以上となる（ステップＳＴ１０５のＹＥＳ）。したがって、方向選択部１７は、対象方向Ｄ２のＣ＿ＳＮＲ２（ｐ）を、ＳＮＲ２（ｐ）に設定する（ステップＳＴ１０８）。 On the other hand, in the target direction D2, since the SNRi (p) sharply increases with respect to the SNRi (p-1) due to the occurrence of the sound source SS in the current frame (p), the SNRi (p-1) The amount of change in SNRi (p) is greater than or equal to threshold value TH1 (YES in step ST105). Therefore, the direction selection unit 17 sets C_SNR2 (p) in the target direction D2 to SNR2 (p) (step ST108).

こうして設定された各対象方向ＤｉのＣ＿ＳＮＲｉ（ｐ）が、特徴量記憶部１６に記憶される（ステップＳＴ１０９）。 The C_SNRi (p) of each target direction Di set in this way is stored in the feature amount storage unit 16 (step ST109).

次に、方向選択部１７は、Ｃ＿ＳＮＲｉ（ｐ）の最大値と最小値との差が、閾値ＴＨ２以上であるか判定する（ステップＳＴ１０９）。Ｃ＿ＳＮＲｉ（ｐ）の最大値は、Ｃ＿ＳＮＲ２（ｐ）である。Ｃ＿ＳＮＲｉ（ｐ）の最小値は、Ｃ＿ＳＮＲ１（ｐ），Ｃ＿ＳＮＲ３（ｐ）〜Ｃ＿ＳＮＲ８（ｐ）のいずれかである。ここでは、最小値は、Ｃ＿ＳＮＲ１（ｐ）であるものとする。 Next, the direction selection unit 17 determines whether the difference between the maximum value and the minimum value of C_SNRi (p) is greater than or equal to the threshold value TH2 (step ST109). The maximum value of C_SNRi (p) is C_SNR2 (p). The minimum value of C_SNRi (p) is one of C_SNR1 (p) and C_SNR3 (p) to C_SNR8 (p). Here, it is assumed that the minimum value is C_SNR1 (p).

Ｃ＿ＳＮＲ２（ｐ）（＝ＳＮＲ２（ｐ））とＣ＿ＳＮＲ１（ｐ）との差は、閾値ＴＨ２以上であるため（ステップＳＴ１０９のＹＥＳ）、方向選択部１７は、対象方向Ｄ２を出力方向Ｄｏｕｔとして選択する（ステップＳＴ１１０）。 Since the difference between C_SNR2 (p) (= SNR2 (p)) and C_SNR1 (p) is greater than or equal to the threshold value TH2 (YES in step ST109), the direction selection unit 17 selects the target direction D2 as the output direction Dout. (Step ST110).

出力方向Ｄｏｕｔが対象方向Ｄ２であることを通知された出力部１８は、候補信号記憶部１４から対象方向Ｄ２の目標信号Ｓ２を読み出し、読み出した目標信号Ｓ２を出力信号として出力する（ステップＳＴ１１４）。 The output unit 18 that is notified that the output direction Dout is the target direction D2 reads the target signal S2 in the target direction D2 from the candidate signal storage unit 14, and outputs the read target signal S2 as an output signal (step ST114). .

以上説明したとおり、本実施形態に係る音響信号処理装置１は、音源が存在しない場合、雑音が最も小さい対象方向Ｄｉを出力方向Ｄｏｕｔとして選択し、出力方向Ｄｏｕｔからの音に応じた出力信号を出力する。これにより、雑音が最も小さい対象方向Ｄｉから到来した音が、通信相手に送信される。結果として、雑音の送信を抑制することができる。 As described above, when there is no sound source, the acoustic signal processing device 1 according to the present embodiment selects the target direction Di with the smallest noise as the output direction Dout, and outputs an output signal corresponding to the sound from the output direction Dout. Output. Thereby, the sound that has arrived from the target direction Di with the smallest noise is transmitted to the communication partner. As a result, noise transmission can be suppressed.

また、本実施形態に係る音響信号処理装置１は、音源が存在する場合、音源が存在する対象方向Ｄｉを出力方向Ｄｏｕｔとして選択し、出力方向Ｄｏｕｔからの音に応じた出力信号を出力する。これにより、音源が存在する対象方向Ｄｉから到来した音、すなわち、音源から到来した音が、通信相手に送信される。結果として、音源から到来した音を、精度よく集音し、通信相手に送信することができる。 Moreover, when the sound source exists, the acoustic signal processing device 1 according to the present embodiment selects the target direction Di where the sound source exists as the output direction Dout, and outputs an output signal corresponding to the sound from the output direction Dout. Thereby, the sound that has arrived from the target direction Di in which the sound source exists, that is, the sound that has arrived from the sound source is transmitted to the communication partner. As a result, sound arriving from the sound source can be collected with high accuracy and transmitted to the communication partner.

この音響信号処理装置１を音声会議システム等に適用した場合、音響信号処理装置１は、発話者（音源）が存在しない場合、雑音が最も小さい対象方向Ｄｉから到来した音を通信相手に送信する。また、音響信号処理装置１は、発話者（音源）が存在する場合、発話者が発話した音声を通信相手に送信する。すなわち、音響信号処理装置１は、発話者が発話した音声を精度よく収集しつつ、通信相手に送信される雑音を抑制することができる。結果として、本実施形態に係る音響信号処理装置１は、快適な会議環境を実現することができる。 When this acoustic signal processing device 1 is applied to an audio conference system or the like, the acoustic signal processing device 1 transmits a sound arriving from the target direction Di with the smallest noise to a communication partner when there is no speaker (sound source). . In addition, when there is a speaker (sound source), the acoustic signal processing device 1 transmits the voice spoken by the speaker to the communication partner. That is, the acoustic signal processing device 1 can suppress noise transmitted to the communication partner while accurately collecting the speech uttered by the speaker. As a result, the acoustic signal processing device 1 according to the present embodiment can realize a comfortable conference environment.

また、本実施形態に係る音響信号処理装置１は、無指向性マイクロホンにより集音する。したがって、音響信号処理装置１を適用することにより、指向性マイクロホンにより集音する従来の音声会議システム等に比べて、音声会議システム等を安価に構成することができる。 Moreover, the acoustic signal processing device 1 according to the present embodiment collects sound with an omnidirectional microphone. Therefore, by applying the acoustic signal processing device 1, it is possible to configure an audio conference system or the like at a lower cost than a conventional audio conference system or the like that collects sound with a directional microphone.

また、本実施形態に係る音響信号処理装置１は、過去（前フレーム）の特徴量Ｃｉに対する現在（現フレーム）の特徴量Ｃｉの変化量に基づいて、対象方向Ｄｉの音源の有無を判定する。したがって、音響信号処理装置１は、音源発生時の音を、取りこぼしなく集音できる。これは、音源発生時には、音響信号が急峻に立ち上がり、特徴量Ｃｉが大きく変化するためである。 Further, the acoustic signal processing device 1 according to the present embodiment determines the presence / absence of a sound source in the target direction Di based on the change amount of the current (current frame) feature value Ci with respect to the past (previous frame) feature value Ci. . Therefore, the acoustic signal processing apparatus 1 can collect the sound when the sound source is generated without missing it. This is because when the sound source is generated, the acoustic signal rises steeply and the feature value Ci greatly changes.

また、本実施形態に係る音響信号処理装置１は、過去一定期間（ｊフレーム分の期間）の特徴量Ｃｉを考慮して、現在の特徴量Ｃｉを補正し、補正された特徴量Ｃｉに基づいて、音源の有無を判定する。したがって、瞬間的な音響信号（特徴量Ｃｉ）の揺らぎに起因する出力方向Ｄｏｕｔの誤選択を抑制することができる。 In addition, the acoustic signal processing device 1 according to the present embodiment corrects the current feature amount Ci in consideration of the feature amount Ci in a past fixed period (a period of j frames), and based on the corrected feature amount Ci. To determine the presence or absence of a sound source. Therefore, it is possible to suppress erroneous selection of the output direction Dout due to instantaneous fluctuation of the acoustic signal (feature amount Ci).

なお、本実施形態において、ステップＳＴ１１１，ＳＴ１１３は省略されてもよい。これにより、出力方向Ｄｏｕｔの選択処理の計算量を減らし、選択処理を高速化することができる。ステップＳＴ１１１，ＳＴ１１３を省略する場合、ステップＳＴ１０９の判定により、音源の方向の変化の有無を判定すればよい。すなわち、ステップＳＴ１０９のＮＯの場合、ステップＳＴ１１２の処理を実行すればよい。 In this embodiment, steps ST111 and ST113 may be omitted. Thereby, the calculation amount of the selection process of the output direction Dout can be reduced, and the selection process can be speeded up. When steps ST111 and ST113 are omitted, the presence or absence of a change in the direction of the sound source may be determined based on the determination in step ST109. That is, in the case of NO in step ST109, the process in step ST112 may be executed.

＜第２実施形態＞
第２実施形態に係る音響信号処理装置１について、図７及び図８を参照して説明する。本実施形態では、高い指向性を有する音響信号処理装置１について説明する。 Second Embodiment
The acoustic signal processing apparatus 1 according to the second embodiment will be described with reference to FIGS. In the present embodiment, an acoustic signal processing device 1 having high directivity will be described.

（音響信号処理装置１の構成）
図７は、本実施形態に係る音響信号処理装置１の機能構成の一例を示す図である。図７の音響信号処理装置１は、候補信号生成部１９を更に備える。他の機能構成及びハードウェア構成は、第１実施形態と同様である。 (Configuration of acoustic signal processing apparatus 1)
FIG. 7 is a diagram illustrating an example of a functional configuration of the acoustic signal processing device 1 according to the present embodiment. The acoustic signal processing device 1 of FIG. 7 further includes a candidate signal generation unit 19. Other functional configurations and hardware configurations are the same as those in the first embodiment.

候補信号生成部１９は、ビームフォーマ部１３が生成した各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉに基づいて、各対象方向Ｄｉの候補信号Ｏｉを生成し、生成した候補信号Ｏｉを候補信号記憶部１４に格納する。本実施形態では、候補信号記憶部１４は、候補信号生成部１９が生成した各対象方向Ｄｉの候補信号Ｏｉを記憶する。 The candidate signal generation unit 19 generates a candidate signal Oi in each target direction Di based on the target signal Si and noise signal Ni in each target direction Di generated by the beamformer unit 13, and uses the generated candidate signal Oi as a candidate signal. Store in the storage unit 14. In the present embodiment, the candidate signal storage unit 14 stores the candidate signal Oi for each target direction Di generated by the candidate signal generation unit 19.

候補信号生成部１９は、目標信号Ｓｉから雑音信号Ｎｉ（雑音成分）を除去することにより、候補信号Ｏｉを生成する。これにより、目標信号ＳｉよりＳＮＲｉが改善された候補信号Ｏｉが生成される。 The candidate signal generation unit 19 generates the candidate signal Oi by removing the noise signal Ni (noise component) from the target signal Si. As a result, a candidate signal Oi having SNRi improved from the target signal Si is generated.

目標信号Ｓｉから雑音信号Ｎｉを除去する方法は、例えば、ＭＭＳＥ−ＳＴＳＡ法（Minimum Mean-Square-Error Short-Time Spectral Amplitude estimator）であるが、これに限られない。なお、候補信号生成部１９は、プロセッサ２０１がメモリ２０２に格納されたプログラムを実行することにより実現される。 The method of removing the noise signal Ni from the target signal Si is, for example, the MMSE-STSA method (Minimum Mean-Square-Error Short-Time Spectral Amplitude estimator), but is not limited thereto. The candidate signal generation unit 19 is realized by the processor 201 executing a program stored in the memory 202.

（音響信号処理装置１の動作）
図８は、本実施形態に係る音響信号処理装置１の動作の一例を示すフローチャートである。図８のフローチャートは、ステップＳＴ１１５を有する。他のステップは、第１実施形態と同様である。 (Operation of the acoustic signal processing apparatus 1)
FIG. 8 is a flowchart showing an example of the operation of the acoustic signal processing apparatus 1 according to the present embodiment. The flowchart in FIG. 8 includes step ST115. Other steps are the same as in the first embodiment.

本実施形態では、ステップＳＴ１０２において、各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉを生成したビームフォーマ部１３は、生成した各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉを、特徴量計算部１５及び候補信号生成部１９に渡す。 In this embodiment, in step ST102, the beamformer unit 13 that has generated the target signal Si and noise signal Ni in each target direction Di generates the generated target signal Si and noise signal Ni in each target direction Di. 15 and the candidate signal generator 19.

各対象方向Ｄｉの目標信号Ｓｉ及び雑音信号Ｎｉを受け取ると、候補信号生成部１９は、各対象方向Ｄｉの目標信号Ｓｉから雑音信号Ｎｉを除去することにより、各対象方向Ｄｉの候補信号Ｏｉを生成する（ステップＳＴ１１５）。候補信号Ｏｉの生成方法は上述の通りである。そして、候補信号生成部１９は、生成した候補信号Ｏｉを候補信号記憶部１４に格納する。候補信号記憶部１４は、格納された候補信号Ｏｉを、対象方向Ｄｉ毎に記憶する（ステップＳＴ１０３）。 When receiving the target signal Si and the noise signal Ni in each target direction Di, the candidate signal generator 19 removes the noise signal Ni from the target signal Si in each target direction Di, thereby obtaining the candidate signal Oi in each target direction Di. Generate (step ST115). The method for generating the candidate signal Oi is as described above. Then, the candidate signal generation unit 19 stores the generated candidate signal Oi in the candidate signal storage unit 14. The candidate signal storage unit 14 stores the stored candidate signal Oi for each target direction Di (step ST103).

以降の処理は第１実施形態と同様である。ただし、本実施形態では、ステップＳＴ１１４において、出力部１８は、出力方向Ｄｏｕｔの候補信号Ｏｏｕｔを出力信号として出力する。 The subsequent processing is the same as in the first embodiment. However, in this embodiment, in step ST114, the output unit 18 outputs the candidate signal Oout in the output direction Dout as an output signal.

以上説明したとおり、本実施形態に係る音響信号処理装置１は、出力信号として、目標信号ＳｉよりＳＮＲｉが改善された候補信号Ｏｉを出力することができる。したがって、本実施形態によれば、音響信号処理装置１の指向性をより鋭くすることができる。 As described above, the acoustic signal processing device 1 according to the present embodiment can output the candidate signal Oi whose SNRi is improved from the target signal Si as an output signal. Therefore, according to the present embodiment, the directivity of the acoustic signal processing device 1 can be made sharper.

＜第３実施形態＞
第３実施形態に係る音響信号処理装置１について、図９を参照して説明する。本実施形態では、雑音信号Ｎｉを使用せずに、特徴量Ｃｉを計算できる音響信号処理装置１について説明する。 <Third Embodiment>
The acoustic signal processing apparatus 1 according to the third embodiment will be described with reference to FIG. In the present embodiment, an acoustic signal processing device 1 that can calculate a feature value Ci without using a noise signal Ni will be described.

本実施形態に係る音響信号処理装置１の機能構成は、ビームフォーマ部１３及び特徴量計算部１５を除き、第１実施形態と同様である。また、音響信号処理装置１のハードウェア構成は、第１実施形態と同様である。以下、本実施形態におけるビームフォーマ部１３及び特徴量計算部１５について説明する。 The functional configuration of the acoustic signal processing device 1 according to the present embodiment is the same as that of the first embodiment except for the beam former unit 13 and the feature amount calculation unit 15. The hardware configuration of the acoustic signal processing device 1 is the same as that of the first embodiment. Hereinafter, the beam former unit 13 and the feature amount calculation unit 15 in the present embodiment will be described.

ビームフォーマ部１３は、音響信号に基づいて、各対象方向Ｄｉの目標信号Ｓｉを生成する。しかしながら、ビームフォーマ部１３は、雑音信号Ｎｉを生成しない。 The beam former 13 generates a target signal Si in each target direction Di based on the acoustic signal. However, the beam former 13 does not generate the noise signal Ni.

特徴量計算部１５は、対象方向Ｄｉ（第１の対象方向）の目標信号Ｓｉと、他の対象方向Ｄｊ（第２の対象方向）の目標信号Ｓｊと、に基づいて、対象方向Ｄｉの特徴量Ｃｉを計算する。本実施形態において、他の対象方向Ｄｊの目標信号Ｓｊは、第１実施形態における雑音信号Ｎｉ、すなわち、目標信号Ｓｉの雑音成分として利用される。これは、目標信号Ｓｊは、対象方向Ｄｉとは異なる対象方向Ｄｊから到来する音に対応する音響信号である。 The feature amount calculation unit 15 performs the feature of the target direction Di based on the target signal Si in the target direction Di (first target direction) and the target signal Sj in the other target direction Dj (second target direction). The quantity Ci is calculated. In the present embodiment, the target signal Sj in the other target direction Dj is used as the noise signal Ni in the first embodiment, that is, the noise component of the target signal Si. The target signal Sj is an acoustic signal corresponding to a sound coming from a target direction Dj different from the target direction Di.

対象方向Ｄｊは、対象方向Ｄｉを除くｎ−１個の対象方向Ｄの中から、任意に選択可能である。ただし、目標信号Ｓｊは、目標信号Ｓｉの雑音成分として利用されるため、対象方向Ｄｉから到来した音に対応する音響信号を含まないのが好ましい。したがって、対象方向Ｄｊは、対象方向Ｄｉの反対方向（対象方向Ｄｉと１８０度異なる方向）に近いのが好ましく、対象方向Ｄｉの反対方向であるのがより好ましい。このような目標信号Ｓｊを目標信号Ｓｉの雑音成分として利用することにより、特徴量計算部１５は、特徴量Ｃｉを精度よく計算することができる。 The target direction Dj can be arbitrarily selected from n−1 target directions D excluding the target direction Di. However, since the target signal Sj is used as a noise component of the target signal Si, it is preferable that the target signal Sj does not include an acoustic signal corresponding to the sound arriving from the target direction Di. Therefore, the target direction Dj is preferably close to the direction opposite to the target direction Di (a direction different from the target direction Di by 180 degrees), and more preferably the direction opposite to the target direction Di. By using such a target signal Sj as a noise component of the target signal Si, the feature amount calculator 15 can calculate the feature amount Ci with high accuracy.

なお、特徴量Ｃｉの計算方法は、第１実施形態と同様である。本実施形態では、第１実施形態における雑音信号Ｎｉを、目標信号Ｓｊと読み替えればよい。例えば、対象方向ＤｉのＳＮＲｉ（ｐ）は、以下の式で計算される。 The method for calculating the feature amount Ci is the same as that in the first embodiment. In the present embodiment, the noise signal Ni in the first embodiment may be read as the target signal Sj. For example, SNRi (p) in the target direction Di is calculated by the following equation.

図９は、本実施形態における特徴量Ｃｉの計算方法の具体例を示す図である。図９の例では、８個の対象方向Ｄｉ（ｉ＝１〜８）が等間隔で設定されている。例えば、対象方向Ｄ２の特徴量Ｃ２は、目標信号Ｓ２と、目標信号Ｓ６と、に基づいて計算されている。目標信号Ｓ６は、目標信号Ｓ２の雑音成分として利用される。また、対象方向Ｄ６の特徴量Ｃ６は、目標信号Ｓ６と、目標信号Ｓ２と、に基づいて計算されている。目標信号Ｓ２は、目標信号Ｓ６の雑音成分として利用される。 FIG. 9 is a diagram illustrating a specific example of the method for calculating the feature amount Ci in the present embodiment. In the example of FIG. 9, eight target directions Di (i = 1 to 8) are set at equal intervals. For example, the feature amount C2 in the target direction D2 is calculated based on the target signal S2 and the target signal S6. The target signal S6 is used as a noise component of the target signal S2. The feature amount C6 in the target direction D6 is calculated based on the target signal S6 and the target signal S2. The target signal S2 is used as a noise component of the target signal S6.

以上説明したとおり、本実施形態に係る音響信号処理装置１によれば、雑音信号Ｎｉを使用することなく、特徴量Ｃｉを計算することができる。したがって、ビームフォーマ部１３により雑音信号Ｎｉを計算できない（遅延和ビームフォーマの減算器出力が得られない）場合であっても、特徴量Ｃｉを計算することができる。 As described above, according to the acoustic signal processing device 1 according to the present embodiment, the feature amount Ci can be calculated without using the noise signal Ni. Therefore, even when the noise signal Ni cannot be calculated by the beamformer unit 13 (the subtracter output of the delayed sum beamformer cannot be obtained), the feature amount Ci can be calculated.

なお、本実施形態において、目標信号Ｓｉの一部は、他の目標信号Ｓｊ（雑音成分）としてのみ使用されてもよい。例えば、図９において、目標信号Ｓ５〜Ｓ８を、それぞれ目標信号Ｓ１〜Ｓ４の雑音成分としてのみ使用することが考えられる。この場合、特徴量Ｃ１〜Ｃ４は計算され、特徴量Ｃ５〜Ｃ８は計算されない。したがって、出力方向Ｄｏｕｔは、対象方向Ｄ１〜Ｄ４の中から選択される。 In the present embodiment, a part of the target signal Si may be used only as another target signal Sj (noise component). For example, in FIG. 9, it can be considered that the target signals S5 to S8 are used only as noise components of the target signals S1 to S4, respectively. In this case, the feature amounts C1 to C4 are calculated, and the feature amounts C5 to C8 are not calculated. Therefore, the output direction Dout is selected from the target directions D1 to D4.

また、１つの目標信号Ｓｊが、複数の目標信号Ｓｉの雑音成分として使用されてもよい。例えば、図９において、目標信号Ｓ８を、目標信号Ｓ１〜Ｓ７の雑音成分として使用することが考えられる。この場合、目標信号Ｓ８に基づいて、特徴量Ｃ１〜Ｃ７が計算される。 One target signal Sj may be used as a noise component of a plurality of target signals Si. For example, in FIG. 9, it can be considered that the target signal S8 is used as a noise component of the target signals S1 to S7. In this case, feature amounts C1 to C7 are calculated based on the target signal S8.

また、特徴量Ｃｉは、複数の目標信号Ｓｊから計算された雑音成分を使用して計算されてもよい。例えば、図９において、目標信号Ｓ４〜Ｓ６を平均化した信号を雑音成分として使用することにより、特徴量Ｃ１を計算することが考えられる。 The feature amount Ci may be calculated using noise components calculated from the plurality of target signals Sj. For example, in FIG. 9, it is conceivable to calculate the feature amount C1 by using a signal obtained by averaging the target signals S4 to S6 as a noise component.

＜第４実施形態＞
第４実施形態に係る音響信号処理装置１について、図１０〜図１２を参照して説明する。本実施形態では、複数の特徴量Ｃを使用することにより、所望の種類の音を検出できる音響信号処理装置１について説明する。 <Fourth embodiment>
The acoustic signal processing apparatus 1 according to the fourth embodiment will be described with reference to FIGS. In the present embodiment, an acoustic signal processing device 1 that can detect a desired type of sound by using a plurality of feature amounts C will be described.

本実施形態に係る音響信号処理装置１の機能構成は、特徴量計算部１５及び方向選択部１７を除き、第１実施形態と同様である。また、音響信号処理装置１のハードウェア構成は、第１実施形態と同様である。以下、本実施形態における特徴量計算部１５及び方向選択部１７について説明する。 The functional configuration of the acoustic signal processing apparatus 1 according to the present embodiment is the same as that of the first embodiment except for the feature amount calculation unit 15 and the direction selection unit 17. The hardware configuration of the acoustic signal processing device 1 is the same as that of the first embodiment. Hereinafter, the feature amount calculation unit 15 and the direction selection unit 17 in the present embodiment will be described.

特徴量計算部１５は、各対象方向Ｄｉに対して複数の特徴量Ｃｉを計算し、特徴量記憶部１６に格納する。以下では、特徴量Ｃｉとして、１番目からＱ番目までのＱ個の特徴量が計算されるものとする。対象方向Ｄｉのｑ番目の特徴量Ｃｉを、特徴量Ｃｉｑとする。複数の特徴量Ｃｉｑを、特徴量Ｃｉと総称する。このとき、特徴量Ｃｉは、以下のような、Ｑ個の特徴量Ｃｉｑを含むベクトルで表される。 The feature amount calculation unit 15 calculates a plurality of feature amounts Ci for each target direction Di and stores them in the feature amount storage unit 16. In the following, it is assumed that Q feature quantities from the first to the Qth are calculated as the feature quantity Ci. The qth feature amount Ci in the target direction Di is set as a feature amount Ciq. The plurality of feature amounts Ciq are collectively referred to as a feature amount Ci. At this time, the feature value Ci is represented by a vector including Q feature values Ciq as follows.

本実施形態において、特徴量Ｃｉｑは、ＳＮＲなどの音響特徴量に限られない。特徴量Ｃｉｑは、高次統計量や識別器のスコアであってもよい。高次統計量は、例えば、カートシス、キュムラントなどである。識別器は、例えば、隠れマルコフモデル、混合ガウスモデル（ＧＭＭ：Gaussian Mixture Model）、深層ニューラルネットワーク（ＤＮＮ：Deep Neural Network）などである。特徴量Ｃｉｑは、検出したい音の種類（例えば、音声）の特徴に相関が高いものであるのが好ましい。 In the present embodiment, the feature quantity Ciq is not limited to an acoustic feature quantity such as SNR. The feature quantity Ciq may be a higher-order statistic or a discriminator score. Higher order statistics are, for example, cartesis, cumulant, and the like. The classifier is, for example, a hidden Markov model, a Gaussian Mixture Model (GMM), a deep neural network (DNN), or the like. It is preferable that the feature amount Ciq has a high correlation with the feature of the type of sound to be detected (for example, voice).

方向選択部１７は、各対象方向Ｄｉの特徴量Ｃｉに基づいて、出力方向Ｄｏｕｔを選択する。以下、出力方向Ｄｏｕｔの２つの選択方法について、それぞれ説明する。 The direction selection unit 17 selects the output direction Dout based on the feature amount Ci of each target direction Di. Hereinafter, two methods for selecting the output direction Dout will be described.

（第１の選択方法）
まず、特徴量Ｃｉを利用した出力方向Ｄｏｕｔの第１の選択方法について説明する。第１の選択方法は、特徴量Ｃｉに基づいて、対象方向Ｄｉの評価値Ｖｉを計算し、評価値Ｖｉに基づいて出力方法Ｄｏｕｔを選択する方法である。 (First selection method)
First, a first method for selecting the output direction Dout using the feature amount Ci will be described. The first selection method is a method of calculating the evaluation value Vi of the target direction Di based on the feature amount Ci and selecting the output method Dout based on the evaluation value Vi.

評価値Ｖｉは、例えば、Ｑ個の特徴量Ｃｉｑの線形和であるが、これに限られない。評価値Ｖｉが線形和である場合、特徴量Ｃｉｑの重み係数をｗｑとすると、評価値Ｖｉは、以下の式で表される。 The evaluation value Vi is, for example, a linear sum of Q feature values Ciq, but is not limited thereto. When the evaluation value Vi is a linear sum, the evaluation value Vi is represented by the following expression, where wq is a weighting coefficient of the feature quantity Ciq.

評価値Ｖｉは、検出したい音の音源が存在する蓋然性が高いほど、大きく（又は小さく）なるように計算されるのが好ましい。例えば、音声を検出したい場合、発話者が存在する蓋然性が高いほど、大きく（又は小さく）なるように、評価値Ｖｉは計算される。このような評価値Ｖｉは、例えば、特徴量Ｃｉｑとして、発話者（音声に対応する音響信号）の有無を識別する識別器のスコアを利用することにより可能となる。式（９）の重み係数ｗは実験によって適切に設定すればよい。 The evaluation value Vi is preferably calculated so as to be larger (or smaller) as the probability that the sound source of the sound to be detected exists is higher. For example, when it is desired to detect speech, the evaluation value Vi is calculated so as to increase (or decrease) as the probability that a speaker is present is higher. Such an evaluation value Vi is made possible by using, for example, the score of a discriminator that identifies the presence or absence of a speaker (acoustic signal corresponding to speech) as the feature amount Ciq. What is necessary is just to set the weighting coefficient w of Formula (9) appropriately by experiment.

図１０は、第１の選択方法の一例を示すフローチャートである。図１０の処理は、第１実施形態におけるステップＳＴ１０４以降の処理に相当する。したがって、図１０の処理の開始時点で、特徴量記憶部１６には、各対象方向Ｄｉの特徴量Ｃｉが記憶されている。以下では、音響信号処理装置１が音声を検出する場合を例に説明する。 FIG. 10 is a flowchart illustrating an example of the first selection method. The process of FIG. 10 corresponds to the process after step ST104 in the first embodiment. Therefore, at the start of the process of FIG. 10, the feature amount storage unit 16 stores the feature amount Ci of each target direction Di. Hereinafter, a case where the acoustic signal processing device 1 detects sound will be described as an example.

まず、方向選択部１７は、特徴量記憶部１６から対象方向Ｄｉの特徴量Ｃｉを読み出し、読み出した各特徴量Ｃｉに基づいて、各対象方向Ｄｉの評価値Ｖｉを計算する（ステップＳＴ２０１）。ここでは、発話者が存在する蓋然性が高いほど、評価値Ｖｉは大きくなるものとする。 First, the direction selection unit 17 reads the feature amount Ci of the target direction Di from the feature amount storage unit 16, and calculates the evaluation value Vi of each target direction Di based on the read feature amount Ci (step ST201). Here, it is assumed that the evaluation value Vi increases as the probability that a speaker is present is higher.

次に、方向選択部１７は、評価値Ｖｉの最大値（ｍａｘ（Ｖｉ））と最小値（ｍｉｎ（Ｖｉ））との差が、予め設定された閾値ＴＨ４以上であるか判定する（ステップＳＴ２０２）。 Next, the direction selection unit 17 determines whether the difference between the maximum value (max (Vi)) and the minimum value (min (Vi)) of the evaluation value Vi is equal to or greater than a preset threshold value TH4 (step ST202). ).

この判定は、発話者が存在すると判定するための処理に相当する。これは、発話者が存在する場合、発話者が存在する対象方向Ｄｉの評価値Ｖｉと、発話者が存在しない対象方向Ｄｉの評価値Ｖｉと、の差が大きくなるためである。閾値ＴＨ４を適切に設定することにより、発話者が存在すると判定することができる。閾値ＴＨ４の適切な値は、実験により決定すればよい。 This determination corresponds to a process for determining that a speaker is present. This is because when there is a speaker, the difference between the evaluation value Vi of the target direction Di where the speaker is present and the evaluation value Vi of the target direction Di where there is no speaker is large. By appropriately setting the threshold value TH4, it can be determined that a speaker is present. An appropriate value of the threshold value TH4 may be determined by experiment.

評価値Ｖｉの最大値と最小値との差が閾値ＴＨ４以上である場合（ステップＳＴ２０２のＹＥＳ）、方向選択部１７は、発話者が存在すると判定する。そして、方向選択部１７は、評価値Ｖｉが最大の対象方向Ｄｉを、出力方向Ｄｏｕｔとして選択する（ステップＳＴ２０３）。この出力方向Ｄｏｕｔは、存在すると判定された発話者の方向に相当する。 When the difference between the maximum value and the minimum value of evaluation value Vi is equal to or greater than threshold value TH4 (YES in step ST202), direction selection unit 17 determines that there is a speaker. Then, the direction selection unit 17 selects the target direction Di having the maximum evaluation value Vi as the output direction Dout (step ST203). This output direction Dout corresponds to the direction of the speaker determined to exist.

評価値Ｖｉの最大値と最小値との差が閾値ＴＨ４未満である場合（ステップＳＴ２０２のＮＯ）、方向選択部１７は、評価値Ｖｉの最大値と最小値との差が、予め設定された閾値ＴＨ５以上であるか判定する（ステップＳＴ２０４）。閾値ＴＨ５は、閾値ＴＨ４より小さく設定される（ＴＨ４＞ＴＨ５）。 When the difference between the maximum value and the minimum value of the evaluation value Vi is less than the threshold value TH4 (NO in step ST202), the direction selection unit 17 sets the difference between the maximum value and the minimum value of the evaluation value Vi in advance. It is determined whether or not the threshold value is TH5 or more (step ST204). The threshold value TH5 is set smaller than the threshold value TH4 (TH4> TH5).

この判定は、発話者が存在しないと判定するための処理に相当する。これは、発話者が存在しない場合、各対象方向Ｄｉの評価値Ｖｉの差が小さくなるためである。閾値ＴＨ５を適切に設定することにより、発話者が存在しないと判定することができる。閾値ＴＨ５の適切な値は、実験により決定すればよい。 This determination corresponds to a process for determining that there is no speaker. This is because when there is no speaker, the difference between the evaluation values Vi in each target direction Di is small. By appropriately setting the threshold TH5, it can be determined that there is no speaker. An appropriate value for the threshold TH5 may be determined by experiment.

評価値Ｖｉの最大値と最小値との差が閾値ＴＨ５未満である場合（ステップＳＴ２０４のＮＯ）、方向選択部１７は、発話者が存在しないと判定する。そして、方向選択部１７は、評価値Ｖｉが最小の対象方向Ｄｉを、出力方向Ｄｏｕｔとして選択する（ステップＳＴ２０５）。この出力方向Ｄｏｕｔは、到来する雑音が最も小さい対象方向Ｄｉに相当する。 When the difference between the maximum value and the minimum value of evaluation value Vi is less than threshold value TH5 (NO in step ST204), direction selection unit 17 determines that there is no speaker. Then, the direction selection unit 17 selects the target direction Di having the smallest evaluation value Vi as the output direction Dout (step ST205). This output direction Dout corresponds to the target direction Di with the smallest incoming noise.

これは、発話者が存在しない場合、各対象方向Ｄｉから到来している音は、いずれも雑音であり、評価値Ｖｉの大きさは、雑音の受信感度の高さに相当する、と考えられるためである。 This is because, when there is no speaker, the sounds coming from each target direction Di are all noise, and the magnitude of the evaluation value Vi corresponds to the high reception sensitivity of noise. Because.

一方、評価値Ｖｉの最大値と最小値との差が閾値ＴＨ５以上である場合（ステップＳＴ２０４のＹＥＳ）、方向選択部１７は、発話者の有無が不明であると判定する。そして、方向選択部１７は、前フレーム（ｐ−１）において出力方向Ｄｉとして選択した対象方向Ｄｉを、現フレーム（ｐ）の出力方向Ｄｏｕｔとして選択する（ステップＳＴ２０６）。 On the other hand, when the difference between the maximum value and the minimum value of the evaluation value Vi is equal to or greater than the threshold value TH5 (YES in step ST204), the direction selection unit 17 determines that the presence or absence of the speaker is unknown. Then, the direction selection unit 17 selects the target direction Di selected as the output direction Di in the previous frame (p-1) as the output direction Dout of the current frame (p) (step ST206).

方向選択部１７は、出力方向Ｄｏｕｔを選択すると、選択した出力方向Ｄｏｕｔを出力部１８に通知する。出力方向Ｄｏｕｔを通知された出力部１８は、候補信号記憶部１４から、出力方向Ｄｏｕｔの候補信号を読み出し、読み出した候補信号を出力信号として出力する（ステップＳＴ２０７）。 When selecting the output direction Dout, the direction selection unit 17 notifies the output unit 18 of the selected output direction Dout. The output unit 18 notified of the output direction Dout reads the candidate signal in the output direction Dout from the candidate signal storage unit 14, and outputs the read candidate signal as an output signal (step ST207).

なお、第１の選択方法において、ステップＳＴ２０４，ＳＴ１０６は省略されてもよい。これにより、出力方向Ｄｏｕｔの選択処理の計算量を減らし、選択処理を高速化することができる。ステップＳＴ２０４，ＳＴ２０６を省略する場合、ステップＳＴ２０２の判定により、発話者の方向の変化の有無を判定すればよい。すなわち、ステップＳＴ２０２のＮＯの場合、ステップＳＴ２０５の処理を実行すればよい。 In the first selection method, steps ST204 and ST106 may be omitted. Thereby, the calculation amount of the selection process of the output direction Dout can be reduced, and the selection process can be speeded up. When steps ST204 and ST206 are omitted, the presence or absence of a change in the direction of the speaker may be determined based on the determination in step ST202. That is, in the case of NO in step ST202, the process in step ST205 may be executed.

また、図１０の例では、方向選択部１７は、評価値Ｖｉの相対評価により、発話者の有無を判定したが、評価値Ｖｉの絶対評価により、発話者の有無を判定してもよい。この場合、方向選択部１７は、少なくとも１つの評価値Ｖｉが閾値ＴＨ６以上である場合、発話者が存在すると判定し、評価値Ｖｉが最大の対象方向Ｄｉを、出力方向Ｄｏｕｔとして選択する。また、方向選択部１７は、全ての評価値Ｖｉが閾値ＴＨ７未満である場合、発話者が存在しないと判定し、評価値Ｖｉが最小の対象方向Ｄｉを、出力方向Ｄｏｕｔとして選択する。閾値ＴＨ６，ＴＨ７は、実験により適切な値に決定すればよい。また、閾値ＴＨ６，ＴＨ７は、同一であってもよいし異なってもよい。 In the example of FIG. 10, the direction selection unit 17 determines the presence / absence of a speaker by relative evaluation of the evaluation value Vi, but may determine the presence / absence of a speaker by absolute evaluation of the evaluation value Vi. In this case, the direction selection unit 17 determines that there is a speaker when at least one evaluation value Vi is equal to or greater than the threshold value TH6, and selects the target direction Di having the maximum evaluation value Vi as the output direction Dout. When all the evaluation values Vi are less than the threshold value TH7, the direction selection unit 17 determines that there is no speaker, and selects the target direction Di having the smallest evaluation value Vi as the output direction Dout. The threshold values TH6 and TH7 may be determined as appropriate values through experiments. The threshold values TH6 and TH7 may be the same or different.

（第２の選択方法）
次に、特徴量Ｃｉを利用した出力方向Ｄｏｕｔの第２の選択方法について説明する。第２の選択方法は、検出したい音源の有無の判定と、出力方向Ｄｏｕｔの選択と、をそれぞれ異なる特徴量Ｃｉｑを用いて行う方法である。 (Second selection method)
Next, a second method for selecting the output direction Dout using the feature amount Ci will be described. The second selection method is a method in which the presence / absence of a sound source to be detected is determined and the selection of the output direction Dout is performed using different feature amounts Ciq.

図１１は、第２の選択方法の一例を示すフローチャートである。図１１の処理は、第１実施形態におけるステップＳＴ１０４以降の処理に相当する。したがって、図１１の処理の開始時点で、特徴量記憶部１６には、各対象方向Ｄｉの特徴量Ｃｉが記憶されている。以下では、音響信号処理装置１が音声を検出する場合を例に説明する。 FIG. 11 is a flowchart illustrating an example of the second selection method. The process in FIG. 11 corresponds to the process after step ST104 in the first embodiment. Therefore, at the start of the process of FIG. 11, the feature amount storage unit 16 stores the feature amount Ci of each target direction Di. Hereinafter, a case where the acoustic signal processing device 1 detects sound will be described as an example.

また、各対象方向Ｄｉの特徴量Ｃｉには、特徴量Ｃｉ１と、特徴量Ｃｉ２と、が含まれるものとする。特徴量Ｃｉ１は、ＳＮＲｉであり、特徴量Ｃｉ２は、スコアＧｉであるものとする。スコアＧは、対象方向Ｄｉにおける発話者の有無（対象方向Ｄｉから到来した音に音声が含まれるか否か）を示す判定結果である。スコアＧｉは、値が１である場合、対象方向Ｄｉに発話者が存在することを示し、値が０である場合、対象方向Ｄｉに発話者が存在しないことを示すものとする。このようなスコアＧは、ＧＭＭに目標信号Ｓｉなどを入力することにより得られる。 Further, it is assumed that the feature amount Ci in each target direction Di includes a feature amount Ci1 and a feature amount Ci2. The feature amount Ci1 is SNRi, and the feature amount Ci2 is a score Gi. The score G is a determination result indicating the presence or absence of a speaker in the target direction Di (whether or not sound is included in the sound that has arrived from the target direction Di). The score Gi indicates that there is a speaker in the target direction Di when the value is 1, and indicates that there is no speaker in the target direction Di when the value is 0. Such a score G is obtained by inputting the target signal Si or the like into the GMM.

まず、方向選択部１７は、特徴量記憶部１６から対象方向Ｄｉの特徴量Ｃｉを読み出し、読み出した各スコアＧｉ（特徴量Ｃｉ２）に基づいて、発話者が存在するか判定する（ステップＳＴ３０１）。方向選択部１７は、値が１であるスコアＧが１つ以上存在する場合、発話者は存在すると判定する。一方、方向選択部１７は、値が１であるスコアＧが１つも存在しない場合、発話者は存在しないと判定する。 First, the direction selection unit 17 reads the feature amount Ci of the target direction Di from the feature amount storage unit 16, and determines whether or not there is a speaker based on each read score Gi (feature amount Ci2) (step ST301). . The direction selection unit 17 determines that the speaker is present when one or more scores G having a value of 1 exist. On the other hand, if there is no score G having a value of 1, the direction selection unit 17 determines that there is no speaker.

発話者が存在する場合（ステップＳＴ３０１のＹＥＳ）、方向選択部１７は、スコアＧｉの値が１である対象方向Ｄｉの中から、ＳＮＲｉが最大の対象方向Ｄｉを出力方向Ｄｏｕｔとして選択する（ステップＳＴ３０２）。 When a speaker is present (YES in step ST301), the direction selection unit 17 selects the target direction Di having the maximum SNRi as the output direction Dout from the target directions Di having a score Gi value of 1 (step ST30). ST302).

一方、発話者が存在しない場合（ステップＳＴ３０１のＮＯ）、方向選択部１７は、ＳＮＲｉが最小の対象方向Ｄｉを出力方向Ｄｏｕｔとして選択する（ステップＳＴ３０３）。 On the other hand, when there is no speaker (NO in step ST301), the direction selection unit 17 selects the target direction Di having the smallest SNRi as the output direction Dout (step ST303).

方向選択部１７は、出力方向Ｄｏｕｔを選択すると、選択した出力方向Ｄｏｕｔを出力部１８に通知する。出力方向Ｄｏｕｔを通知された出力部１８は、候補信号記憶部１４から、出力方向Ｄｏｕｔの候補信号を読み出し、読み出した候補信号を出力信号として出力する（ステップＳＴ３０４）。 When selecting the output direction Dout, the direction selection unit 17 notifies the output unit 18 of the selected output direction Dout. The output unit 18 notified of the output direction Dout reads the candidate signal in the output direction Dout from the candidate signal storage unit 14, and outputs the read candidate signal as an output signal (step ST304).

図１２は、第２の選択方法の具体例を説明する図である。図１２の例では、８個の対象方向Ｄｉ（ｉ＝１〜８）が等間隔で設定されている。対象方向Ｄ２には発話者ではない音源ＳＳ２（例えば、スピーカ）が存在する。この音源ＳＳ２から到来した音により、対象方向Ｄ２は、スコアＧ２（特徴量Ｃ２２）が０となり、ＳＮＲ２（特徴量Ｃ２１）が２０となっている。 FIG. 12 is a diagram illustrating a specific example of the second selection method. In the example of FIG. 12, eight target directions Di (i = 1 to 8) are set at equal intervals. A sound source SS2 (for example, a speaker) that is not a speaker is present in the target direction D2. Due to the sound coming from the sound source SS2, the target direction D2 has a score G2 (feature amount C22) of 0 and an SNR2 (feature amount C21) of 20.

また、対象方向Ｄ５には音源ＳＳ５が存在する。音源ＳＳ５は発話者である。この音源ＳＳ５から到来した音（音声）により、対象方向Ｄ５は、スコアＧ５（特徴量Ｃ５２）が１となり、ＳＮＲ５（特徴量Ｃ５１）が１０となっている。なお、説明を簡単にするため、他の対象方向ＤｉのスコアＧｉ及びＳＮＲｉはいずれも０であるものとする。 A sound source SS5 is present in the target direction D5. The sound source SS5 is a speaker. Due to the sound (voice) coming from the sound source SS5, the target direction D5 has a score G5 (feature value C52) of 1 and an SNR5 (feature value C51) of 10. For simplicity of explanation, it is assumed that the scores Gi and SNRi of the other target directions Di are both 0.

この場合、スコアＧ５が１であるため、方向選択部１７は、発話者は存在すると判定する（ステップＳＴ３０１のＹＥＳ）。そして、方向選択部１７は、スコアＧが１である対象方向Ｄｉのうち、ＳＮＲｉが最大の対象方向Ｄｉである、対象方向Ｄ５を、出力方向Ｄｏｕｔとして選択する（ステップＳＴ３０２）。その後、出力部１８が、対象方向Ｄ５の候補信号が出力信号として出力する（ステップＳＴ３０４）。すなわち、音響信号処理装置１は、音源ＳＳ５（発話者）からの音（音声）を出力する。 In this case, since the score G5 is 1, the direction selection unit 17 determines that there is a speaker (YES in step ST301). Then, the direction selection unit 17 selects, as the output direction Dout, the target direction D5 that is the target direction Di having the maximum SNRi among the target directions Di having the score G of 1 (step ST302). Thereafter, the output unit 18 outputs the candidate signal in the target direction D5 as an output signal (step ST304). That is, the acoustic signal processing device 1 outputs sound (voice) from the sound source SS5 (speaker).

以上説明したとおり、本実施形態に係る音響信号処理装置１は、複数の特徴量Ｃｉｑを使用することにより、所望の種類の音（音源）を検出し、音（音源）を検出した対象方向Ｄｉの中から、出力方向Ｄｏｕｔを選択することができる。この結果、雑音（所望の種類の音とは異なる種類の音）が到来している場合であっても、所望の種類の音の音源が存在する対象方向Ｄｉを、出力方向Ｄｏｕｔとして選択することができる。 As described above, the acoustic signal processing apparatus 1 according to the present embodiment detects a desired type of sound (sound source) by using a plurality of feature amounts Ciq, and the target direction Di in which the sound (sound source) is detected. The output direction Dout can be selected from among the above. As a result, even when noise (sound of a different type from the desired type of sound) has arrived, the target direction Di in which the sound source of the desired type of sound exists is selected as the output direction Dout. Can do.

例えば、図１２の例では、音源ＳＳ２からの雑音のＳＮＲ２が、音源ＳＳ５からの音声のＳＮＲ５より大きい。しかしながら、上述の通り、音響信号処理装置１は、対象方向Ｄ５を出力方向Ｄｏｕｔとして選択し、音源ＳＳ５からの音声に対応する出力信号を出力している。 For example, in the example of FIG. 12, the SNR2 of the noise from the sound source SS2 is larger than the SNR5 of the sound from the sound source SS5. However, as described above, the acoustic signal processing device 1 selects the target direction D5 as the output direction Dout and outputs an output signal corresponding to the sound from the sound source SS5.

本実施形態に係る音響信号処理装置１を音声会議システム等に適用した場合、音響信号処理装置１は、雑音の影響を抑制し、発話者から到来した音声を精度よく集音し、通信相手に送信することができる。 When the acoustic signal processing apparatus 1 according to the present embodiment is applied to an audio conference system or the like, the acoustic signal processing apparatus 1 suppresses the influence of noise, accurately collects voices coming from a speaker, Can be sent.

なお、上記実施形態に挙げた構成等に、その他の要素との組み合わせなど、ここで示した構成に本発明が限定されるものではない。これらの点に関しては、本発明の趣旨を逸脱しない範囲で変更することが可能であり、その応用形態に応じて適切に定めることができる。 It should be noted that the present invention is not limited to the configuration shown here, such as a combination with other elements in the configuration described in the above embodiment. These points can be changed without departing from the spirit of the present invention, and can be appropriately determined according to the application form.

１：音響信号処理装置
１１：集音部
１２：音響信号記憶部
１３：ビームフォーマ部
１４：候補信号記憶部
１５：特徴量計算部
１６：特徴量記憶部
１７：方向選択部
１８：出力部
１９：候補信号生成部
１００：マイクロホンアレイ
２００：コンピュータ 1: Acoustic signal processing device 11: sound collecting unit 12: acoustic signal storage unit 13: beam former unit 14: candidate signal storage unit 15: feature amount calculation unit 16: feature amount storage unit 17: direction selection unit 18: output unit 19 : Candidate signal generation unit 100: Microphone array 200: Computer

特許第５３７７５１８号公報Japanese Patent No. 5377518

Claims

A beamformer unit that generates a target signal, which is an acoustic signal corresponding to a sound arriving from a target direction, based on the acoustic signals of a plurality of channels output from a plurality of microphones, for the plurality of target directions;
A feature amount calculation unit that calculates a feature amount in each target direction based on the target signal in each target direction;
A direction selection unit that selects an output direction from the plurality of target directions based on the feature amount of each target direction;
An acoustic signal processing device comprising:

The direction selection unit determines the presence or absence of a sound source in each target direction based on the feature amount, and when the sound source is determined to be present, selects the target direction with the maximum sound coming from the sound source, The acoustic signal processing device according to claim 1, wherein when it is determined that the sound source does not exist, the target direction with the smallest incoming sound is selected.

The feature amount calculation unit generates a noise signal that is an acoustic signal corresponding to a sound arriving from a direction different from the target direction for each target direction,
The acoustic signal processing apparatus according to claim 1, wherein the feature amount calculation unit calculates the feature amount based on the target signal and the noise signal.

The acoustic signal processing apparatus according to claim 3, further comprising a candidate signal generation unit that removes the noise signal from the target signal.

The feature quantity calculation unit calculates the feature quantity based on the target signal in a first target direction and the target signal in a second target direction different from the first target direction. The acoustic signal processing device according to claim 1 or 2.

The acoustic signal processing apparatus according to claim 1, wherein the feature amount includes a signal-to-noise ratio.

The feature amount calculation unit calculates the feature amount based on a frequency component of the target signal equal to or lower than a first frequency corresponding to an installation interval of the plurality of microphones. The acoustic signal processing device according to item 1.

The acoustic signal processing device according to claim 1, wherein the feature amount calculation unit calculates a plurality of the feature amounts for each of the target directions.

The direction selection unit averages the current feature amount when the change amount of the current feature amount with respect to the previous feature amount is less than a predetermined threshold, and outputs the output based on the averaged feature amount. The acoustic signal processing device according to any one of claims 1 to 8, wherein a direction is selected.

Generating a target signal, which is an acoustic signal corresponding to a sound arriving from a target direction, for a plurality of the target directions based on the multi-channel acoustic signals output by the plurality of microphones;
Calculating a feature quantity of each target direction based on the target signal of each target direction;
Selecting an output direction from among the plurality of target directions based on the feature amount of each target direction;
An acoustic signal processing method.

Generating a target signal, which is an acoustic signal corresponding to a sound arriving from a target direction, for a plurality of the target directions based on the multi-channel acoustic signals output by the plurality of microphones;
Calculating a feature quantity of each target direction based on the target signal of each target direction;
Selecting an output direction from among the plurality of target directions based on the feature amount of each target direction;
A program that causes a computer to execute.