JP2020018015A

JP2020018015A - Acoustic signal processing device, method and program

Info

Publication number: JP2020018015A
Application number: JP2019197593A
Authority: JP
Inventors: 小林　和則; Kazunori Kobayashi; 和則小林; 弘章伊藤; Hiroaki Ito; 翔一郎齊藤; Shoichiro Saito; 登原田; Noboru Harada; 卓哉樋口; Takuya Higuchi; 信貴伊藤; Nobutaka Ito; 荒木　章子; Akiko Araki; 章子荒木; 慶介木下; Keisuke Kinoshita; 中谷　智広; Tomohiro Nakatani
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-07-31
Filing date: 2019-10-30
Publication date: 2020-01-30
Anticipated expiration: 2037-07-31
Also published as: JP6969597B2

Abstract

To provide an acoustic signal processing device capable of accurately performing directional sound collection.SOLUTION: The acoustic signal processing device comprises: a direction estimation part 22 configured to estimate an arrival direction of a sound from signals collected by multiple microphones in a manner that a direction closer to the arrival direction which is estimated at a point of time when a predetermined specific sound is detected tends to be estimated as arrival direction; and a first directional sound collection part 23 configured to collect sounds in a manner that sounds from the arrival direction estimated by the direction estimation part 22 is emphasized.SELECTED DRAWING: Figure 6

Description

この発明は、音響信号の処理技術に関する。 The present invention relates to an audio signal processing technique.

特許文献１，２に記載されている指向性集音技術が知られている（例えば、特許文献１，２参照。）。 2. Description of the Related Art Directional sound collection techniques described in Patent Literatures 1 and 2 are known (for example, refer to Patent Literatures 1 and 2).

図１２は、特許文献１，２等で開示されている従来の指向性集音装置の構成である。図１２の指向性集音装置は、方向推定部４１及び指向性集音部４２を備えている。 FIG. 12 shows a configuration of a conventional directional sound collection device disclosed in Patent Documents 1, 2 and the like. The directional sound collecting device of FIG. 12 includes a direction estimating unit 41 and a directional sound collecting unit 42.

方向推定部４１は、複数のマイクロホンで集音された信号に基づき、音源の方向推定を行う。ここでは、マイクロホン間で発生する時間差や振幅差を手掛かりに推定を行う。 The direction estimation unit 41 estimates the direction of a sound source based on signals collected by a plurality of microphones. Here, the estimation is performed based on the time difference and the amplitude difference generated between the microphones.

次に、指向性集音部４２は、その推定された方向の音を強調して集音するように、指向性集音を行う。指向性集音部４２は、狙った方向の音が強調されるように遅延時間やフィルタ係数を設定することで、推定方向の音を強調することができる。この指向性集音技術によれば、音源が１つであれば、その音源の方向を推定し、その音源の方向を強調した集音を行うことができる。 Next, the directional sound collection unit 42 performs directional sound collection such that the sound in the estimated direction is emphasized and collected. The directional sound collection unit 42 can emphasize the sound in the estimated direction by setting the delay time and the filter coefficient so that the sound in the aimed direction is emphasized. According to this directional sound collection technique, if there is one sound source, it is possible to estimate the direction of the sound source and perform sound collection in which the direction of the sound source is emphasized.

特開２００１−３０９４８３号公報JP 2001-309483 A 特開２００５−６４９６８号公報JP 2005-64968 A

しかし、従来の指向性集音装置では、集音したい音源と、雑音源の両方が存在する場合には、どちらが集音したい音源か見分けることができず、雑音源を強調してしまうという誤った動作をしてしまう可能性があった。例えば、リビングで音声認識を用いて対話や機器の操作を行うようなロボットやリモコンを使うシーンを想定すると、ＴＶ等の音源にも反応してしまい誤動作を起こす可能性があった。 However, in a conventional directional sound collecting device, when both a sound source to be collected and a noise source are present, it is not possible to distinguish which of the sound sources to be collected and to erroneously emphasize the noise source. There was a possibility of operating. For example, assuming a scene in which a robot or a remote control is used in a living room for performing a dialogue or operating a device using voice recognition, there is a possibility that a malfunction may occur due to a reaction to a sound source such as a TV.

この発明の目的は、より精度の高い指向性集音を行う音響信号処理装置、方法及びプログラムを提供することである。 An object of the present invention is to provide an acoustic signal processing device, method, and program for performing highly accurate directional sound collection.

この発明の一態様による音響信号処理装置は、複数のマイクロホンで集音された信号から音の到来方向を、予め定められた音である特定音が検出された時において推定された到来方向に近い方向ほど特定音の到来方向であると推定されやすくなるように推定する方向推定部と、方向推定部で推定された到来方向からの音が強調されるように集音を行う第一指向性集音部と、を備えており、方向推定部により、特定音が検出される時に複数の到来方向が推定された場合、過去の所定の区間において推定された到来方向を利用して、複数の到来方向の中から特定音の到来方向を選択する。 The sound signal processing device according to an aspect of the present invention is configured such that a direction of arrival of a sound from signals collected by a plurality of microphones is close to an estimated direction of arrival when a specific sound that is a predetermined sound is detected. A direction estimating unit for estimating the direction of arrival of the specific sound so that the direction is more likely to be estimated, and a first directional collecting unit for collecting sound so as to emphasize sounds from the direction of arrival estimated by the direction estimating unit A sound section, and when the direction estimating section estimates a plurality of directions of arrival when a specific sound is detected, a plurality of incoming directions are estimated using the directions of arrival estimated in a predetermined section in the past. Select the arrival direction of the specific sound from the directions.

事前に得られている特定音から得られる情報に基づく音響信号処理を行うことで、より精度の高い指向性集音を行うことができる。 By performing the acoustic signal processing based on the information obtained from the specific sound obtained in advance, more accurate directional sound collection can be performed.

第一実施形態の音響信号処理装置の例を説明するためのブロック図。FIG. 2 is a block diagram for explaining an example of the acoustic signal processing device according to the first embodiment. 第一実施形態の変形例１の音響信号処理装置の例を説明するためのブロック図。FIG. 7 is a block diagram for explaining an example of an acoustic signal processing device according to a first modification of the first embodiment. 第一実施形態の変形例２の音響信号処理装置の例を説明するためのブロック図。FIG. 9 is a block diagram for explaining an example of an acoustic signal processing device according to a second modification of the first embodiment. 第一実施形態の変形例３の音響信号処理装置の例を説明するためのブロック図。FIG. 9 is a block diagram for explaining an example of an acoustic signal processing device according to a third modification of the first embodiment. 音響信号処理方法の例を説明するための流れ図。5 is a flowchart for explaining an example of a sound signal processing method. 第二実施形態の音響信号処理装置の例を説明するためのブロック図。FIG. 6 is a block diagram for explaining an example of an acoustic signal processing device according to a second embodiment. 第二実施形態の方向推定部２２の例を説明するためのブロック図。FIG. 9 is a block diagram for explaining an example of a direction estimating unit 22 according to the second embodiment. 第二実施形態の方向推定部２２の例を説明するためのブロック図。FIG. 9 is a block diagram for explaining an example of a direction estimating unit 22 according to the second embodiment. 第二実施形態の変形例２の音響信号処理装置の例を説明するためのブロック図。FIG. 11 is a block diagram for explaining an example of an acoustic signal processing device according to a second modification of the second embodiment. 第二実施形態の変形例３の音響信号処理装置の例を説明するためのブロック図。FIG. 13 is a block diagram for explaining an example of an acoustic signal processing device according to a third modification of the second embodiment. 音響信号処理方法の例を説明するための流れ図。5 is a flowchart for explaining an example of a sound signal processing method. 背景技術の指向性集音装置の例を説明するためのブロック図。FIG. 2 is a block diagram illustrating an example of a directional sound collecting device according to the background art. 第三実施形態に係る音響信号処理装置の機能ブロック図。FIG. 9 is a functional block diagram of an acoustic signal processing device according to a third embodiment. 第三実施形態に係る音響信号処理装置の処理フローの例を示す図。The figure showing the example of the processing flow of the acoustic signal processor concerning a third embodiment. 第三実施形態に係る音声区間検出情報蓄積部の機能ブロック図。FIG. 14 is a functional block diagram of a voice section detection information storage unit according to the third embodiment. 特定音音声区間、非音声区間を説明するための図。The figure for demonstrating a specific sound voice section and a non-voice section. 第三実施形態に係る音声区間検出部の機能ブロック図。FIG. 13 is a functional block diagram of a voice section detection unit according to the third embodiment. 第三実施形態に係る第一音響信号分析部の機能ブロック図。FIG. 13 is a functional block diagram of a first acoustic signal analyzer according to the third embodiment. 第三実施形態に係る確率推定部の機能ブロック図。FIG. 13 is a functional block diagram of a probability estimating unit according to the third embodiment. 第三実施形態の第一変形例に係る音声区間検出部の機能ブロック図。FIG. 14 is a functional block diagram of a voice section detection unit according to a first modification of the third embodiment. 第三実施形態の第三変形例、第四変形例に係る音響信号処理装置の機能ブロック図。FIG. 14 is a functional block diagram of an acoustic signal processing device according to a third modification and a fourth modification of the third embodiment. 第三実施形態の第三変形例、第四変形例に係る音響信号処理装置の処理フローの例を示す図。The figure showing the example of the processing flow of the acoustic signal processor concerning the 3rd modification and the 4th modification of a third embodiment. 第一実施形態の変形例４の音響信号処理装置の例を説明するためのブロック図。FIG. 13 is a block diagram for explaining an example of an acoustic signal processing device according to a modification 4 of the first embodiment. 第一実施形態の変形例４の音響信号処理方法の例を説明するための流れ図。9 is a flowchart for explaining an example of an acoustic signal processing method according to Modification 4 of the first embodiment.

以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 In the drawings used in the following description, components having the same functions and steps for performing the same processing are denoted by the same reference numerals, and redundant description is omitted. In the following description, the symbol "^" or the like used in the text should be described immediately above the character immediately after it, but is described immediately before the character due to the limitation of the text notation. In the formula, these symbols are described in their original positions. The processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

［技術的背景］
音響信号処理装置は、予め定められた音である特定音についての情報が与えられているとして、その特定音についての情報を用いて音響信号処理を行うものである。事前に与えられた特定音についての情報を用いることにより、使える情報が増えるため、より精度の高い音響信号処理を行うことができる。 [Technical background]
The sound signal processing device performs sound signal processing using information on a specific sound, which is a predetermined sound, assuming that the information is given. By using information about the specific sound given in advance, usable information is increased, so that more accurate acoustic signal processing can be performed.

音響信号処理の例は、音の到来方向の推定、指向性集音、目的音声の抽出、音声区間の検出、音声認識である。 Examples of the acoustic signal processing are estimation of a direction of arrival of sound, directional sound collection, extraction of a target sound, detection of a sound section, and sound recognition.

例えば、ユーザの特定の発話に対して特定音であるキーワードの検出を行うことで、目的音声の信号区間と雑音の信号区間を正確に把握でき、その後の処理に活かすことができる。 For example, by detecting a keyword which is a specific sound for a specific utterance of the user, a signal section of a target voice and a signal section of noise can be accurately grasped, and can be used for subsequent processing.

また、この性質を音声区間検出に用いると、雑音区間と音声区間の信号がそれぞれ判明するため、音声／非音声の判定のためのパラメータをより実測値に即した値へ更新することができる。 Further, when this property is used for voice section detection, the signals in the noise section and the voice section are respectively identified, so that the parameter for voice / non-voice determination can be updated to a value that more closely matches the actually measured value.

また、音響信号処理として音声の方向推定を行う場合には、特定音を検出した方向を音声の方向とみなすことで、本来の方向以外から音声を含む音が到来したとしても方向推定が頑健に動作する。 In addition, when sound direction estimation is performed as acoustic signal processing, the direction in which a specific sound is detected is regarded as the direction of the sound, so that even when sound including sound arrives from a direction other than the original direction, the direction estimation is robust. Operate.

また、音響信号処理として目的音声抽出を行う場合には、音声区間と非音声区間の信号が精度よく得られるため、音声分離のためのステアリングベクトルを計算するための空間相関行列をより正確に求めることができる。 In addition, when the target voice is extracted as the audio signal processing, the signals of the voice section and the non-voice section can be obtained with high accuracy, so that the spatial correlation matrix for calculating the steering vector for voice separation is more accurately obtained. be able to.

また、音響信号処理として音声認識を行う場合には、雑音レベルをより正確に得られるため、音響モデルの選択により精度を向上させることができる。 In the case of performing speech recognition as acoustic signal processing, the noise level can be obtained more accurately, so that the accuracy can be improved by selecting an acoustic model.

以下、図面を参照して、各実施形態について説明する。 Hereinafter, each embodiment will be described with reference to the drawings.

［第一実施形態］
第一実施形態の音響信号処理装置及び方法は、音響信号処理として指向性集音処理を行う。 [First embodiment]
The sound signal processing device and method of the first embodiment perform directional sound collection processing as sound signal processing.

音響信号処理装置は、図１１に示すように、方向推定部１１、特定音検出部１２、方向記憶部１３及び第一指向性集音部１４を例えば備えている。音響信号処理装置は、特定音検出部１２を備えていなくてもよい。 As shown in FIG. 11, the acoustic signal processing device includes, for example, a direction estimation unit 11, a specific sound detection unit 12, a direction storage unit 13, and a first directional sound collection unit 14. The acoustic signal processing device does not need to include the specific sound detection unit 12.

音響信号処理方法は、音響信号処理装置が、図５及び以下に説明するステップＳ１１からステップＳ１４の処理を行うことにより例えば実現される。 The acoustic signal processing method is realized, for example, by the acoustic signal processing device performing the processing of FIG. 5 and steps S11 to S14 described below.

方向推定部１１は、複数のマイクロホンで集音された信号から音の到来方向を推定する（ステップＳ１１）。方向推定部１１は、各時刻における音の到来方向を推定する。推定された各時刻における音の到来方向は、方向記憶部１３に出力される。 The direction estimating unit 11 estimates a direction of arrival of sound from signals collected by a plurality of microphones (step S11). The direction estimating unit 11 estimates a direction of arrival of a sound at each time. The estimated arrival direction of the sound at each time is output to the direction storage unit 13.

方向推定部１１による方向推定の方式は任意である。方向推定部１１は、例えば特許文献１，２に記載された方向推定技術により音の到来方向を推定する。音の到来方向は、方向ではなく、位置により表されるものであってもよい。 The method of direction estimation by the direction estimating unit 11 is arbitrary. The direction estimating unit 11 estimates the direction of arrival of the sound using, for example, the direction estimating technology described in Patent Documents 1 and 2. The arrival direction of the sound may be represented not by the direction but by the position.

特定音検出部１２は、予め定められた音である特定音を検出する（ステップＳ１２）。予め定められた音の例は、特定のキーワードの音声、口笛及び手拍子である。予め定められた音として、上記の例以外の所定の音が用いられてもよい。 The specific sound detector 12 detects a specific sound that is a predetermined sound (step S12). Examples of predetermined sounds are the sound, whistle and clapping of a particular keyword. A predetermined sound other than the above example may be used as the predetermined sound.

方向記憶部１３には、特定音検出部１２で特定音が検出された時刻における、方向推定部１１で推定された到来方向が記憶される。より詳細には、方向記憶部１３は、方向推定部１１から入力された各時刻における音の到来方向のうち、特定音検出部１２で特定音が検出された時刻における音の到来方向を記憶する。 The direction storage unit 13 stores the arrival direction estimated by the direction estimation unit 11 at the time when the specific sound is detected by the specific sound detection unit 12. More specifically, the direction storage unit 13 stores the arrival direction of the sound at the time when the specific sound is detected by the specific sound detection unit 12 among the arrival directions of the sound at each time input from the direction estimation unit 11. .

第一指向性集音部１４は、方向記憶部１３から読み込んだ到来方向からの音が強調されるように集音を行う（ステップＳ１４）。第一指向性集音部１４による指向性集音の方式は任意である。第一指向性集音部１４は、例えば特開２００９−４４５８８号公報に記載された指向性集音を行う。 The first directional sound collection unit 14 performs sound collection so that the sound from the arrival direction read from the direction storage unit 13 is emphasized (step S14). The method of directional sound collection by the first directional sound collection unit 14 is arbitrary. The first directional sound collection unit 14 performs directional sound collection described in, for example, JP-A-2009-44588.

このように、特定音が発せられた音源を集音すべき音源と判別して、その音源を指向性集音することで、高ＳＮ比で集音することができる。ユーザは、特定のキーワード等の特定音を発することで、指向性の向きを変えることができ、テレビなどの音源が存在している場合でも、自分に対して指向性を向けて、その後固定することができる。 As described above, the sound source emitting the specific sound is determined to be a sound source to be collected, and the sound source is subjected to directional sound collection, whereby sound can be collected with a high SN ratio. The user can change the direction of the directivity by emitting a specific sound such as a specific keyword. Even when a sound source such as a television is present, the user directs the directivity to himself and then fixes it. be able to.

なお、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図１では、遅延部１５を破線で示している。遅延部１５は、特定音検出部１２による特定音の検出の時間に対応する時間だけ方向推定部１１からの出力を遅延させてから方向記憶部１３に入力する。これにより、特定音の検出に遅延があっても正常に動作する。 If it takes a long time for the specific sound detection unit 12 to detect the specific sound, a delay unit 15 that delays by a time corresponding to the time may be provided after the direction estimation unit 11. In FIG. 1, the delay unit 15 is indicated by a broken line. The delay unit 15 delays the output from the direction estimation unit 11 by a time corresponding to the time of detection of the specific sound by the specific sound detection unit 12 and then inputs the output to the direction storage unit 13. As a result, even if there is a delay in the detection of the specific sound, it operates normally.

[[第一実施形態の変形例１]]
図２に例示するように、音響信号処理装置は、推定頻度計測部１６及び選択部１７を更に備えていてもよい。この場合、方向推定部１１は、複数方向の同時推定が可能であってもよい。すなわち、方向推定部１１は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能であってもよい。この場合、どちらの音源で特定音が発せられたかの判別ができなくなってしまうので、推定頻度計測部１６が、過去に方向推定がどのくらい行われたかで、その判別を行う。すなわち、推定頻度計測部１６は、ＴＶ等の音源は常に音が出力されているので、過去に多数の方向推定が行われているものと考えられるので、これを手掛かりに判別する。 [[Modification 1 of First Embodiment]]
As illustrated in FIG. 2, the acoustic signal processing device may further include an estimated frequency measurement unit 16 and a selection unit 17. In this case, the direction estimating unit 11 may be capable of simultaneous estimation in a plurality of directions. That is, when there is a sound of a noise source at the same time as a specific sound, the direction estimation unit 11 may be able to estimate the directions of both sound sources. In this case, it becomes impossible to determine which sound source produced the specific sound, so the estimation frequency measurement unit 16 makes a determination based on how much direction estimation has been performed in the past. That is, since the sound source such as a TV always outputs sound, it is considered that a large number of direction estimations have been performed in the past, and the estimation frequency measurement unit 16 determines this using the clue as a clue.

推定頻度計測部１６は、過去の所定の時間区間における、方向推定部１１で推定された到来方向の頻度を計測する（ステップＳ１６）。すなわち、推定頻度計測部１６は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部１７に出力される。 The estimated frequency measuring unit 16 measures the frequency of the arrival direction estimated by the direction estimating unit 11 in a predetermined time section in the past (step S16). That is, the estimated frequency measurement unit 16 measures how frequently the direction has been estimated within a fixed time in the past. Information about the measured frequency is output to the selection unit 17.

例えば、過去Ｔ秒の間に、方向推定部１１の出力が方向θであった時間をA(θ)秒とすれば、θ方向の推定頻度は、それらの比D(θ)＝A(θ)/Ｔで求められる。推定頻度計測部１６は、この頻度を各方向についてすべて求める。雑音源がテレビや音楽受聴用のスピーカであると想定した場合、長時間、ほとんど無音になることなく、同じ方向から音が発せられることになる。このような音源がθ方向にあった場合、推定頻度D(θ)は１に近い大きな値をとることになる。 For example, if the time during which the output of the direction estimating unit 11 was in the direction θ during the past T seconds is A (θ) seconds, the frequency of estimation in the θ direction will be the ratio D (θ) = A (θ ) / T. The estimated frequency measurement unit 16 obtains all the frequencies for each direction. Assuming that the noise source is a television or a speaker for listening to music, sound will be emitted from the same direction for a long time with almost no silence. When such a sound source exists in the θ direction, the estimated frequency D (θ) takes a large value close to 1.

選択部１７は、推定頻度計測部１６で計測された頻度の中で最も低い頻度の到来方向を選択する。例えば、選択部１７は、方向推定部１１の出力の推定方向が２個であった場合に、推定頻度D(θ)が小さい方を選択する。特定音検出部１２で特定音が検出された時刻における、選択部１７で選択された到来方向が、方向記憶部１３に記憶される。 The selecting unit 17 selects the arrival direction with the lowest frequency among the frequencies measured by the estimated frequency measuring unit 16. For example, when the number of estimated directions of the output of the direction estimating unit 11 is two, the selecting unit 17 selects the one with the smaller estimated frequency D (θ). The arrival direction selected by the selection unit 17 at the time when the specific sound is detected by the specific sound detection unit 12 is stored in the direction storage unit 13.

その後、第一指向性集音部１４は、上記と同様にして、方向記憶部１３から読み込んだ到来方向からの音が強調されるように集音を行う。 After that, the first directional sound collection unit 14 performs sound collection so that the sound from the arrival direction read from the direction storage unit 13 is emphasized in the same manner as described above.

なお、第一実施形態の変形例１においても、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図２では、遅延部１５を破線で示している。これにより、特定音の検出に遅延があっても正常に動作する。 Also in the first modification of the first embodiment, when it takes time to detect the specific sound by the specific sound detection unit 12, the delay unit 15 that delays by a time corresponding to the time is added to the subsequent stage of the direction estimation unit 11. You may put in. In FIG. 2, the delay unit 15 is indicated by a broken line. As a result, even if there is a delay in the detection of the specific sound, it operates normally.

[[第一実施形態の変形例２]]
図３に例示するように、音響信号処理装置は、第二指向性集音部１８を更に備えていてもよい。 [[Modification 2 of the first embodiment]]
As illustrated in FIG. 3, the acoustic signal processing device may further include a second directional sound collection unit 18.

特定音検出部１２の処理の前に、第二指向性集音部１８による指向性集音を行うことで、より高精度な特定音の検出を行うことができる。 By performing the directional sound collection by the second directional sound collection unit 18 before the processing of the specific sound detection unit 12, more specific detection of the specific sound can be performed.

第二指向性集音部１８には、複数のマイクロホンで集音された信号を遅延させた信号が入力される。この遅延は、方向推定部１１による到来方向の推定処理に必要な時間に対応する時間の長さを持つ。この遅延は、図３に破線で示されている遅延部１９により行われる。また、第二指向性集音部１８には、方向推定部１１で推定された到来方向が入力される。 A signal obtained by delaying a signal collected by a plurality of microphones is input to the second directional sound collection unit 18. This delay has a time length corresponding to the time required for the arrival direction estimation processing by the direction estimation unit 11. This delay is performed by the delay unit 19 shown by a broken line in FIG. The direction of arrival estimated by the direction estimating unit 11 is input to the second directional sound collection unit 18.

第二指向性集音部１８は、方向推定部１１で推定された到来方向からの音が強調されるように集音を行う（ステップＳ１８）。より詳細には、第二指向性集音部１８は、複数のマイクロホンで集音された信号を遅延させた信号を用いて、方向推定部１１で推定された到来方向からの音が強調されるように集音を行う。第二指向性集音部１８で集音された信号は、特定音検出部１２に出力される。 The second directional sound collection unit 18 performs sound collection such that the sound from the arrival direction estimated by the direction estimation unit 11 is emphasized (step S18). More specifically, the second directional sound collection unit 18 emphasizes the sound from the arrival direction estimated by the direction estimation unit 11 by using a signal obtained by delaying the signals collected by the plurality of microphones. So that sound is collected. The signal collected by the second directional sound collection unit 18 is output to the specific sound detection unit 12.

特定音検出部１２は、第二指向性集音部１８により集音された信号に基づいて特定音を検出する。その後の処理は、上記と同様である。 The specific sound detector 12 detects a specific sound based on the signal collected by the second directional sound collector 18. Subsequent processing is the same as described above.

なお、図３に示すように、複数の第二指向性集音部１８が音響信号処理装置に備えられていてもよい。この場合、第二指向性集音部１８の数と同数の特定音検出部１２が音響信号処理装置に備えられている。 As shown in FIG. 3, a plurality of second directional sound collection units 18 may be provided in the acoustic signal processing device. In this case, as many specific sound detectors 12 as the number of the second directional sound collectors 18 are provided in the acoustic signal processing device.

この場合、方向推定部１１で複数の到来方向が推定された場合には、特定音検出部１２は、推定された複数の到来方向のそれぞれを強調するように動作し、それらの出力がそれぞれ複数の特定音検出部１２に入力され、特定音の検出が行われる。 In this case, when the direction estimating unit 11 estimates a plurality of directions of arrival, the specific sound detecting unit 12 operates to emphasize each of the estimated directions of arrival, and outputs a plurality of outputs each. Is input to the specific sound detection unit 12 to detect the specific sound.

これにより、複数の特定音検出部１２で特定音が検出された場合に、優先順位を付けることが可能となる。 Thereby, when the specific sounds are detected by the plurality of specific sound detection units 12, it is possible to assign a priority.

なお、第一実施形態の変形例２においても、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図２では、遅延部１５を破線で示している。これにより、特定音の検出に遅延があっても正常に動作する。 Also in the second modification of the first embodiment, when it takes time for the specific sound to be detected by the specific sound detection unit 12, the delay unit 15 that delays by a time corresponding to the time is added to the downstream of the direction estimation unit 11. You may put in. In FIG. 2, the delay unit 15 is indicated by a broken line. As a result, even if there is a delay in the detection of the specific sound, it operates normally.

[[第一実施形態の変形例３]]
図４に例示するように、第一実施形態の変形例２において、第一実施形態の変形例１で説明した推定頻度計測部１６及び選択部１７を音響信号処理装置は更に備えていてもよい。この場合、方向推定部１１は、複数方向の同時推定が可能であってもよい。すなわち、方向推定部１１は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能であってもよい。 [[Modification 3 of First Embodiment]]
As illustrated in FIG. 4, in Modification 2 of the first embodiment, the acoustic signal processing device may further include the estimated frequency measurement unit 16 and the selection unit 17 described in Modification 1 of the first embodiment. . In this case, the direction estimating unit 11 may be capable of simultaneous estimation in a plurality of directions. That is, when there is a sound of a noise source at the same time as a specific sound, the direction estimation unit 11 may be able to estimate the directions of both sound sources.

推定頻度計測部１６及び選択部１７の処理は、第一実施形態の変形例１で説明したものと同様である。 The processing of the estimation frequency measurement unit 16 and the selection unit 17 is the same as that described in the first modification of the first embodiment.

すなわち、推定頻度計測部１６は、過去の所定の時間区間における、方向推定部１１で推定された到来方向の頻度を計測する（ステップＳ１６）。すなわち、推定頻度計測部１６は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部１７に出力される。 That is, the estimated frequency measuring unit 16 measures the frequency of the arrival direction estimated by the direction estimating unit 11 in a predetermined time section in the past (step S16). That is, the estimated frequency measurement unit 16 measures how frequently the direction has been estimated within a fixed time in the past. Information about the measured frequency is output to the selection unit 17.

なお、第一実施形態の変形例１においても、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図４では、遅延部１５を破線で示している。これにより、特定音の検出に遅延があっても正常に動作する。 Also in the first modification of the first embodiment, when it takes time to detect the specific sound by the specific sound detection unit 12, the delay unit 15 that delays by a time corresponding to the time is added to the subsequent stage of the direction estimation unit 11. You may put in. In FIG. 4, the delay unit 15 is indicated by a broken line. As a result, even if there is a delay in the detection of the specific sound, it operates normally.

[[第一実施形態の変形例４]]
図２３に例示するように、音響信号処理装置は、第一指向性集音部１４にかえて第三指向性集音部５２を備えるとともに、雑音方向記憶部５１をさらに備えてもよい。 [[Modification 4 of the first embodiment]]
As illustrated in FIG. 23, the acoustic signal processing device may include a third directional sound collection unit 52 instead of the first directional sound collection unit 14, and further include a noise direction storage unit 51.

音響信号処理方法は、音響信号処理装置が、図２４及び以下に説明するステップＳ３１の処理を行うことにより例えば実現される。 The sound signal processing method is realized, for example, by the sound signal processing device performing the processing of FIG. 24 and step S31 described below.

雑音方向記憶部５１には、特定音検出部１２で特定音が検出された時刻を除く、方向推定部１１で推定された到来方向が記憶される。ここで、特定音が検出された時刻を除くとは、特定音が検出された時刻よりも時系列的に前の時刻であってもよいし時系列的に後の時刻であってもよいし前の時刻と後の時刻両方であってもよい。なお、雑音方向記憶部５１の前段かつ方向推定部１１の後段に遅延部１５を入れてもよいのは言うまでもない。 The direction of arrival estimated by the direction estimating unit 11 except for the time at which the specific sound is detected by the specific sound detecting unit 12 is stored in the noise direction storage unit 51. Here, "excluding the time at which the specific sound is detected" may be a time before the time at which the specific sound is detected or a time after the time at which the specific sound is detected. Both the previous time and the later time may be used. Needless to say, the delay unit 15 may be inserted before the noise direction storage unit 51 and after the direction estimation unit 11.

第三指向性集音部５２は方向記憶部１３から読み込んだ到来方向からの音が強調されるようにかつ雑音方向記憶部５１から読み込んだ到来方向からの音が抑圧されるように集音を行う（ステップＳ５２）。第三指向性集音部５２による指向性集音の方式は任意である。第三指向性集音部５２が行う指向性集音の方式は、例えば参考文献５に記載の方式を用いてもよい。
（参考文献５）浅野太著, 「音のアレイ信号処理」, pp.82-85，コロナ社, 2011. The third directional sound collection unit 52 performs sound collection such that sound from the arrival direction read from the direction storage unit 13 is emphasized and sound from the arrival direction read from the noise direction storage unit 51 is suppressed. Perform (Step S52). The method of the directional sound collection by the third directional sound collection unit 52 is arbitrary. As a method of the directional sound collection performed by the third directional sound collection unit 52, for example, a method described in Reference Document 5 may be used.
(Reference 5) Tadashi Asano, "Sound Array Signal Processing", pp.82-85, Corona, 2011.

［第二実施形態］
第一実施形態の音響信号処理装置及び方法は、音響信号処理として指向性集音処理を行う。 [Second embodiment]
The sound signal processing device and method of the first embodiment perform directional sound collection processing as sound signal processing.

音響信号処理装置は、図６に示すように、特定音検出部２１、方向推定部２２、第一指向性集音部２３を例えば備えている。音響信号処理装置は、特定音検出部１２を備えていなくてもよい。 As shown in FIG. 6, the acoustic signal processing device includes, for example, a specific sound detecting unit 21, a direction estimating unit 22, and a first directional sound collecting unit 23. The acoustic signal processing device does not need to include the specific sound detection unit 12.

音響信号処理方法は、音響信号処理装置が、図１１及び以下に説明するステップＳ２１からステップＳ２３の処理を行うことにより例えば実現される。 The sound signal processing method is realized, for example, by the sound signal processing device performing the processing of FIG. 11 and steps S21 to S23 described below.

特定音検出部２１は、予め定められた音である特定音を検出する（ステップＳ２１）。予め定められた音の例は、特定のキーワードの音声、口笛及び手拍子である。予め定められた音として、上記の例以外の所定の音が用いられてもよい。 The specific sound detection unit 21 detects a specific sound that is a predetermined sound (step S21). Examples of predetermined sounds are the sound, whistle and clapping of a particular keyword. A predetermined sound other than the above example may be used as the predetermined sound.

方向推定部２２は、複数のマイクロホンで集音された信号から音の到来方向を推定する（ステップＳ２２）。その際、方向推定部２２は、複数のマイクロホンで集音された信号から音の到来方向を、特定音検出部２１において特定音が検出された時刻において推定された到来方向に近い方向ほど到来方向であると推定されやすくなるように推定する。 The direction estimating unit 22 estimates the direction of arrival of the sound from the signals collected by the plurality of microphones (Step S22). At this time, the direction estimating unit 22 determines the direction of arrival of the sound from the signals collected by the plurality of microphones as the direction closer to the direction of arrival estimated at the time when the specific sound is detected by the specific sound detecting unit 21. Is estimated so as to be easily estimated.

すなわち、方向推定部２２では、特定音の検出の結果に応じて、各方向への検出されやすさが設定される。言い換えれば、方向推定部２２では、特定音の検出時に推定されていた方向に近いほど、方向検出がされやすくなり、遠いほど検出されにくくなる。こうすることにより、特定音を発したユーザに対し指向性が向きやすくなり、雑音源に指向性が向きにくくなる。また、特定音を発したユーザが移動してもそれに追従することができる。 That is, the direction estimating unit 22 sets the ease of detection in each direction according to the result of the detection of the specific sound. In other words, in the direction estimating unit 22, the direction is more likely to be detected as the direction is closer to the direction estimated at the time of detection of the specific sound, and it is more difficult to detect the direction as it is farther. By doing so, the directivity of the user who emits the specific sound is more likely to be directed, and the directivity is less likely to be directed to the noise source. Further, even if the user who emits the specific sound moves, it can follow the movement.

方向推定部２２の構成の例を、図７に示す。図７に例示するように、方向推定部２２は、方向強調部２２１、パワー計算部２２２、重み乗算部２２３、最大パワー方向検出部２２４及び重み決定部２２５を備えている。 FIG. 7 shows an example of the configuration of the direction estimating unit 22. As illustrated in FIG. 7, the direction estimating unit 22 includes a direction emphasizing unit 221, a power calculating unit 222, a weight multiplying unit 223, a maximum power direction detecting unit 224, and a weight determining unit 225.

複数のマイクロホンで集音された信号のそれぞれは、方向強調部２２１に入力される。 Each of the signals collected by the plurality of microphones is input to the direction enhancing unit 221.

方向強調部２２１は、複数のマイクロホンで集音された信号に対し、複数の方向をそれぞれ強調するように方向強調処理を行う（ステップＳ２２１）。例えば、N個の方向強調部２２１が設けられている場合には、θ1,θ2,…,θNを互いに異なる方向として、N個の方向強調部２２１は、それぞれθ1,θ2,…,θNの方向を強調するように方向強調処理を行う。強調された信号は、パワー計算部２２２に出力される。 The direction emphasizing unit 221 performs a direction emphasizing process on the signals collected by the plurality of microphones so as to emphasize a plurality of directions, respectively (step S221). For example, when N direction emphasizing units 221 are provided, θ1, θ2,..., ΘN are set to directions different from each other, and N direction emphasizing units 221 are arranged in directions of θ1, θ2,. Is performed to emphasize the direction. The emphasized signal is output to power calculation section 222.

パワー計算部２２２は、方向強調部２２１で強調された信号のパワーを計算する（ステップＳ２２２）。計算されたパワーは、重み乗算部２２３に出力される。 The power calculator 222 calculates the power of the signal emphasized by the direction enhancer 221 (step S222). The calculated power is output to weight multiplying section 223.

重み乗算部２２３は、パワー計算部２２２で計算されたパワーに、重み設定部２２５で設定された重みを乗じる（ステップＳ２２３）。重み付与後パワーは、最大パワー方向検出部２２４に出力される。後述するように、したがって、重み乗算部２２３は、各到来方向が強調された信号のパワーに、上記各到来方向が上記選択された到来方向に近いほど大きな重みを乗算することにより重み付与後パワーを得る。 The weight multiplication unit 223 multiplies the power calculated by the power calculation unit 222 by the weight set by the weight setting unit 225 (step S223). The weighted power is output to the maximum power direction detection unit 224. As described later, therefore, the weight multiplying unit 223 multiplies the power of the signal in which each direction of arrival is emphasized by a larger weight as the direction of arrival is closer to the selected direction of arrival, so that the weighted power is increased. Get.

最大パワー方向検出部２２４は、重み乗算部２２３の出力のうち最大パワーの到来方向を選択する。言い換えれば、最大パワー方向検出部２２４は、重み付与後パワーが最も大きい到来方向を選択し、その選択された到来方向を推定される到来方向とする（ステップＳ２２４）。推定された到来方向は、方向推定結果として、重み決定部２２５及び第一指向性集音部２３に出力される。 The maximum power direction detection unit 224 selects the arrival direction of the maximum power from the output of the weight multiplication unit 223. In other words, the maximum power direction detection unit 224 selects the direction of arrival with the largest power after weighting, and sets the selected direction of arrival as the estimated direction of arrival (step S224). The estimated direction of arrival is output to the weight determination unit 225 and the first directional sound collection unit 23 as a direction estimation result.

重み設定部２２５は、特定音検出部２１で特定音が検出された時刻において、最大パワー方向検出部２２４が出力した方向推定結果に対応する重みを決定する。決定された重みは、重み乗算部２２３に出力される。言い換えれば、重み設定部２２５は、特定音の検出がありとなったときに、方向推定結果に対応した重みを設定する。 The weight setting unit 225 determines the weight corresponding to the direction estimation result output by the maximum power direction detection unit 224 at the time when the specific sound is detected by the specific sound detection unit 21. The determined weight is output to the weight multiplier 223. In other words, the weight setting unit 225 sets a weight corresponding to the direction estimation result when a specific sound is detected.

方向推定結果に対応した重みは、推定された到来方向に対する重みが大きくなり、その到来方向から離れるにしたがって、重みが小さくなるように設定される。例えば、推定された到来方向に対する重みを1.0とし、その推定された到来方向から10度ずれるごとに1.0未満の乗数（例えば0.8）を乗じた重みが設定される。 The weight corresponding to the direction estimation result is set such that the weight for the estimated direction of arrival increases, and the weight decreases as the distance from the direction of arrival increases. For example, the weight for the estimated direction of arrival is set to 1.0, and a weight that is multiplied by a multiplier less than 1.0 (for example, 0.8) every 10 degrees from the estimated direction of arrival is set.

第一指向性集音部２３は、方向推定部２２で推定された到来方向からの音が強調されるように集音を行う（ステップＳ２３）。第一指向性集音部２３による指向性集音の方式は任意である。第一指向性集音部２３は、例えば特開２００９−４４５８８号公報に記載された指向性集音を行う。 The first directivity sound collection unit 23 performs sound collection so that the sound from the arrival direction estimated by the direction estimation unit 22 is emphasized (step S23). The method of the directional sound collection by the first directional sound collection unit 23 is arbitrary. The first directional sound collection unit 23 performs directional sound collection described in, for example, JP-A-2009-44588.

なお、特定音検出部２１による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部２２６を最大パワー方向検出部２２４の後段に入れてもよい。図７では、遅延部２２６を破線で示している。遅延部２２６は、特定音検出部２１による特定音の検出の時間に対応する時間だけ最大パワー方向検出部２２４からの出力を遅延させてから重み設定部２２５に入力する。これにより、特定音の検出に遅延があっても正常に動作する。 If it takes time for the specific sound detection unit 21 to detect a specific sound, a delay unit 226 that delays by a time corresponding to that time may be provided after the maximum power direction detection unit 224. In FIG. 7, the delay unit 226 is indicated by a broken line. The delay unit 226 delays the output from the maximum power direction detection unit 224 by a time corresponding to the time of detection of the specific sound by the specific sound detection unit 21 and then inputs the output to the weight setting unit 225. As a result, even if there is a delay in the detection of the specific sound, it operates normally.

[[第二実施形態の変形例１]]
図８に例示するように、音響信号処理装置は、推定頻度計測部２２７及び選択部２２８を更に備えていてもよい。 [[Modification 1 of Second Embodiment]]
As illustrated in FIG. 8, the acoustic signal processing device may further include an estimated frequency measurement unit 227 and a selection unit 228.

この場合、最大パワー方向検出部２２４は、所定の閾値を超えるパワー方向全てを検出することにより、複数方向の同時推定が可能であってもよい。すなわち、最大パワー方向検出部２２４は、最大パワーの方向を検出し、検出済みの方向を除いて、さらに最大パワーの方向を検出する。最大パワー方向検出部２２４は、予め設定した最大推定方向数に達するか、最大パワーがあらかじめ設定した閾値以下になった場合に最大パワー検出を終了する。最大パワー方向検出部２２４は、例えばこのような方法により複数の音源の方向を同時に推定可能であってもよい。これにより、最大パワー方向検出部２２４は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能となる。 In this case, the maximum power direction detection unit 224 may be able to perform simultaneous estimation in a plurality of directions by detecting all power directions exceeding a predetermined threshold. That is, the maximum power direction detection unit 224 detects the direction of the maximum power, and further detects the direction of the maximum power except for the detected direction. The maximum power direction detection unit 224 ends the maximum power detection when the preset maximum number of estimated directions is reached or when the maximum power becomes equal to or less than a preset threshold value. The maximum power direction detection unit 224 may be capable of simultaneously estimating the directions of a plurality of sound sources by such a method, for example. This allows the maximum power direction detection unit 224 to estimate the directions of both sound sources when there is a sound of a noise source at the same time as a specific sound.

この場合、どちらの音源で特定音が発せられたかの判別ができなくなってしまうので、推定頻度計測部２２７が、過去に方向推定がどのくらい行われたかで、その判別を行う。すなわち、推定頻度計測部２２７は、ＴＶ等の音源は常に音が出力されているので、過去に多数の方向推定が行われているものと考えられるので、これを手掛かりに判別する。 In this case, it becomes impossible to determine which sound source produced the specific sound, so the estimation frequency measurement unit 227 makes a determination based on how much direction estimation has been performed in the past. That is, since the sound source such as a TV always outputs sound, it is considered that a large number of direction estimations have been performed in the past, and the estimation frequency measurement unit 227 determines this using the clue as a clue.

推定頻度計測部２２７は、過去の所定の時間区間における、方向推定部２２で推定された到来方向の頻度、言い換えれば、最大パワー方向検出部２２で選択された到来方向の頻度を計測する（ステップＳ１６）。すなわち、推定頻度計測部２２７は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部２２８に出力される。 The estimated frequency measuring unit 227 measures the frequency of the incoming direction estimated by the direction estimating unit 22, in other words, the frequency of the incoming direction selected by the maximum power direction detecting unit 22 in a predetermined time section in the past (step). S16). That is, the estimated frequency measurement unit 227 measures how frequently the direction has been estimated within a fixed time in the past. Information about the measured frequency is output to the selection unit 228.

例えば、過去Ｔ秒の間に、最大パワー方向検出部２２４の出力が方向θであった時間をA(θ)秒とすれば、θ方向の推定頻度は、それらの比D(θ)＝A(θ)/Ｔで求められる。推定頻度計測部２２７は、この頻度を各方向についてすべて求める。雑音源がテレビや音楽受聴用のスピーカであると想定した場合、長時間、ほとんど無音になることなく、同じ方向から音が発せられることになる。このような音源がθ方向にあった場合、推定頻度D(θ)は１に近い大きな値をとることになる。 For example, if the time during which the output of the maximum power direction detection unit 224 was in the direction θ during the past T seconds is A (θ) seconds, the estimated frequency in the θ direction will be the ratio D (θ) = A (θ) / T. The estimated frequency measurement unit 227 obtains this frequency for each direction. Assuming that the noise source is a television or a speaker for listening to music, sound will be emitted from the same direction for a long time with almost no silence. When such a sound source exists in the θ direction, the estimated frequency D (θ) takes a large value close to 1.

選択部２２８は、推定頻度計測部２２７で計測された頻度の中で最も低い頻度の到来方向を選択する。例えば、選択部２２８は、最大パワー方向検出部２２の出力の推定方向が２個であった場合に、推定頻度D(θ)が小さい方を選択する。選択された到来方向は、重み設定部２２５に出力される。 The selection unit 228 selects the arrival direction with the lowest frequency among the frequencies measured by the estimated frequency measurement unit 227. For example, when the number of estimated directions of the output of the maximum power direction detecting unit 22 is two, the selecting unit 228 selects the one with the smaller estimated frequency D (θ). The selected arrival direction is output to weight setting section 225.

なお、特定音検出部２１による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部２２６を最大パワー方向検出部２２４の後段に入れてもよい。図８では、遅延部２２６を破線で示している。遅延部２２６は、特定音検出部２１による特定音の検出の時間に対応する時間だけ最大パワー方向検出部２２４からの出力を遅延させてから重み設定部２２５に入力する。これにより、特定音の検出に遅延があっても正常に動作する。 If it takes time for the specific sound detection unit 21 to detect a specific sound, a delay unit 226 that delays by a time corresponding to that time may be provided after the maximum power direction detection unit 224. In FIG. 8, the delay unit 226 is indicated by a broken line. The delay unit 226 delays the output from the maximum power direction detection unit 224 by a time corresponding to the time of detection of the specific sound by the specific sound detection unit 21 and then inputs the output to the weight setting unit 225. As a result, even if there is a delay in the detection of the specific sound, it operates normally.

[[第二実施形態の変形例２]]
図９に例示するように、音響信号処理装置は、第二指向性集音部２４を更に備えていてもよい。 [[Modification 2 of Second Embodiment]]
As illustrated in FIG. 9, the acoustic signal processing device may further include a second directional sound collection unit 24.

特定音検出部２１の処理の前に、第二指向性集音部２４による指向性集音を行うことで、より高精度な特定音の検出を行うことができる。 By performing the directional sound collection by the second directional sound collection unit 24 before the processing of the specific sound detection unit 21, it is possible to detect the specific sound with higher accuracy.

第二指向性集音部２４には、複数のマイクロホンで集音された信号を遅延させた信号が入力される。この遅延は、方向推定部２２による到来方向の推定処理に必要な時間に対応する時間の長さを持つ。この遅延は、図９に破線で示されている遅延部２５により行われる。また、第二指向性集音部２４には、方向推定部２２で推定された到来方向が入力される。 A signal obtained by delaying a signal collected by a plurality of microphones is input to the second directional sound collection unit 24. This delay has a time length corresponding to the time required for the arrival direction estimation processing by the direction estimation unit 22. This delay is performed by the delay unit 25 shown by a broken line in FIG. The direction of arrival estimated by the direction estimating unit 22 is input to the second directional sound collecting unit 24.

第二指向性集音部２４は、方向推定部２２で推定された到来方向からの音が強調されるように集音を行う（ステップＳ２４）。より詳細には、第二指向性集音部２４は、複数のマイクロホンで集音された信号を遅延させた信号を用いて、方向推定部２２で推定された到来方向からの音が強調されるように集音を行う。第二指向性集音部２４で集音された信号は、特定音検出部２１に出力される。 The second directivity sound collection unit 24 performs sound collection such that the sound from the arrival direction estimated by the direction estimation unit 22 is emphasized (step S24). More specifically, the second directional sound collection unit 24 emphasizes the sound from the arrival direction estimated by the direction estimation unit 22 using a signal obtained by delaying the signals collected by the plurality of microphones. So that sound is collected. The signal collected by the second directional sound collection unit 24 is output to the specific sound detection unit 21.

特定音検出部２１は、第二指向性集音部２４により集音された信号に基づいて特定音を検出する。その後の処理は、上記と同様である。 The specific sound detector 21 detects a specific sound based on the signal collected by the second directional sound collector 24. Subsequent processing is the same as described above.

なお、図９に示すように、複数の第二指向性集音部２４が音響信号処理装置に備えられていてもよい。この場合、第二指向性集音部２４の数と同数の特定音検出部２１が音響信号処理装置に備えられている。 In addition, as shown in FIG. 9, a plurality of second directional sound collection units 24 may be provided in the acoustic signal processing device. In this case, as many specific sound detectors 21 as the number of the second directivity sound collectors 24 are provided in the acoustic signal processing device.

この場合、方向推定部２２で複数の到来方向が推定された場合には、特定音検出部２１は、推定された複数の到来方向のそれぞれを強調するように動作し、それらの出力がそれぞれ複数の特定音検出部２１に入力され、特定音の検出が行われる。 In this case, when the direction estimating unit 22 estimates a plurality of directions of arrival, the specific sound detecting unit 21 operates to emphasize each of the estimated directions of arrival, and outputs a plurality of outputs of each. Is input to the specific sound detection unit 21 to detect the specific sound.

これにより、複数の特定音検出部２１で特定音が検出された場合に、優先順位を付けることが可能となる。 Thereby, when the specific sounds are detected by the plurality of specific sound detection units 21, it is possible to assign a priority.

[[第二実施形態の変形例３]]
図１０に例示するように、第二実施形態の変形例２において、推定頻度計測部２６及び選択部２７を音響信号処理装置は更に備えていてもよい。この場合、方向推定部２２は、複数方向の同時推定が可能であってもよい。すなわち、方向推定部２２は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能であってもよい。 [[Modification 3 of Second Embodiment]]
As illustrated in FIG. 10, in Modification 2 of the second embodiment, the acoustic signal processing device may further include an estimated frequency measurement unit 26 and a selection unit 27. In this case, the direction estimating unit 22 may be capable of simultaneous estimation in a plurality of directions. In other words, the direction estimation unit 22 may be able to estimate the directions of both sound sources when there is a sound of a noise source simultaneously with the specific sound.

推定頻度計測部２６及び選択部２７の処理は、第一実施形態の変形例１で説明したものと同様である。 The processing of the estimation frequency measurement unit 26 and the selection unit 27 is the same as that described in the first modification of the first embodiment.

すなわち、推定頻度計測部２６は、過去の所定の時間区間における、方向推定部２２で推定された到来方向の頻度を計測する（ステップＳ２６）。すなわち、推定頻度計測部２６は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部２７に出力される。 That is, the estimated frequency measuring unit 26 measures the frequency of the arrival direction estimated by the direction estimating unit 22 in a predetermined past time section (step S26). That is, the estimated frequency measuring unit 26 measures how frequently the direction has been estimated within a certain period of time in the past. Information about the measured frequency is output to the selection unit 27.

例えば、過去Ｔ秒の間に、方向推定部２２の出力が方向θであった時間をA(θ)秒とすれば、θ方向の推定頻度は、それらの比D(θ)＝A(θ)/Ｔで求められる。推定頻度計測部２６は、この頻度を各方向についてすべて求める。雑音源がテレビや音楽受聴用のスピーカであると想定した場合、長時間、ほとんど無音になることなく、同じ方向から音が発せられることになる。このような音源がθ方向にあった場合、推定頻度D(θ)は１に近い大きな値をとることになる。 For example, if the time during which the output of the direction estimating unit 22 is in the direction θ during the past T seconds is A (θ) seconds, the frequency of estimation in the θ direction will be the ratio D (θ) = A (θ ) / T. The estimated frequency measuring unit 26 calculates this frequency for each direction. Assuming that the noise source is a television or a speaker for listening to music, sound will be emitted from the same direction for a long time with almost no silence. When such a sound source exists in the θ direction, the estimated frequency D (θ) takes a large value close to 1.

選択部２７は、推定頻度計測部２６で計測された頻度の中で最も低い頻度の到来方向を選択する（ステップＳ２７）。例えば、選択部２７は、方向推定部２２の出力の推定方向が２個であった場合に、推定頻度D(θ)が小さい方を選択する。特定音検出部２１で特定音が検出された時刻における、選択部２７で選択された到来方向は、方向推定部２２に出力され、方向推定部２２により推定された到来方向とされる。 The selecting unit 27 selects the arrival direction with the lowest frequency among the frequencies measured by the estimated frequency measuring unit 26 (Step S27). For example, when the number of estimated directions of the output of the direction estimating unit 22 is two, the selecting unit 27 selects the one with the smaller estimated frequency D (θ). The arrival direction selected by the selection unit 27 at the time when the specific sound is detected by the specific sound detection unit 21 is output to the direction estimation unit 22 and is set as the arrival direction estimated by the direction estimation unit 22.

その後、第一指向性集音部２３は、上記と同様にして、方向推定部２２により推定された到来方向からの音が強調されるように集音を行う。 Thereafter, the first directional sound collection unit 23 performs sound collection in a manner similar to the above, so that the sound from the arrival direction estimated by the direction estimation unit 22 is emphasized.

［第三実施形態］
第三実施形態の音響信号処理装置及び方法は、音響信号処理として音声区間の検出を行う。 [Third embodiment]
The acoustic signal processing device and method according to the third embodiment detect a voice section as acoustic signal processing.

＜第三実施形態のポイント＞
本実施形態では、利用者の発話内容を絞り込むことで、利用環境（雑音など）の情報をより正しく得る。例えば、利用者が発話を始める前に特定の単語（キーワード）を発するように制限する。その際に、その特定の単語音声のみを高精度に検出できるようにしておき、「その区間は音声」「その前の区間は雑音」と仮定する。そして、その雑音区間と音声区間の音声を利用して、「音声／非音声」の判定のための情報を更新する。 <Points of the third embodiment>
In the present embodiment, information on the usage environment (such as noise) is more correctly obtained by narrowing down the utterance contents of the user. For example, a restriction is made so that a specific word (keyword) is uttered before the user starts speaking. At this time, it is assumed that only the specific word voice can be detected with high accuracy, and it is assumed that “the section is voice” and “the previous section is noise”. Then, the information for the determination of “speech / non-speech” is updated using the speech in the noise section and the speech section.

そうすることで、その後に発せられる目的の音声の区間を判定する際に、より実利用環境に即した「雑音」と「音声」の情報が利用でき、区間検出の精度が向上する。 By doing so, the information of "noise" and "speech" more suitable for the actual use environment can be used when determining the section of the target sound to be emitted thereafter, and the accuracy of section detection is improved.

以下、音響信号処理装置・方法の実施形態を説明する。音響信号処理装置は、例えば専用のハードウェアで構成された専用機やパーソナルコンピュータのような汎用機といったコンピュータで実現される。ここではコンピュータ（汎用機）で実現する場合として説明する。 Hereinafter, an embodiment of an audio signal processing apparatus and method will be described. The acoustic signal processing device is realized by a computer such as a dedicated machine constituted by dedicated hardware or a general-purpose machine such as a personal computer. Here, a case will be described where the processing is realized by a computer (general-purpose machine).

音響信号処理装置のハードウェア構成例を説明する。 An example of a hardware configuration of the acoustic signal processing device will be described.

音響信号処理装置は、キーボード、ポインティングデバイスなどが接続可能な入力部と、液晶ディスプレイ、CRT（Cathode Ray Tube）ディスプレイなどが接続可能な出力部と、音響信号処理装置外部に通信可能な通信装置（例えば通信ケーブル、LANカード、ルータ、モデムなど）が接続可能な通信部と、CPU（Central Processing Unit）〔DSP（Digital Signal Processor）でも良い。またキャッシュメモリやレジスタなどを備えていてもよい。〕と、メモリであるRAM、ROMや、ハードディスク、光ディスク、半導体メモリなどである外部記憶装置並びにこれらの入力部、出力部、通信部、CPU、RAM、ROM、外部記憶装置間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、音響信号処理装置に、フレキシブルディスク、CD-ROM（Compact Disc Read Only Memory）、DVD（Digital Versatile Disc）などの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。 The acoustic signal processing device has an input unit to which a keyboard and a pointing device can be connected, an output unit to which a liquid crystal display and a CRT (Cathode Ray Tube) display can be connected, and a communication device ( For example, a communication unit to which a communication cable, a LAN card, a router, a modem, and the like can be connected, and a CPU (Central Processing Unit) [DSP (Digital Signal Processor)] may be used. Further, a cache memory or a register may be provided. And external storage devices such as RAM, ROM, hard disk, optical disk, and semiconductor memory, and data exchange between these input units, output units, communication units, CPU, RAM, ROM, and external storage devices. It has a bus connecting as possible. If necessary, the audio signal processing device may be provided with a device (drive) that can read and write a storage medium such as a flexible disk, a CD-ROM (Compact Disc Read Only Memory), and a DVD (Digital Versatile Disc).

また、音響信号処理装置には、例えば音声、音楽、雑音などの音を受音する音響信号収音手段（例えばマイクロホン）を接続可能であって、マイクロホンによって得られた（アナログ）信号の入力を受ける信号入力部、および、再生信号を音として出力する音響出力装置（例えばスピーカ）を接続可能であって、スピーカに入力する信号（再生信号をＤ／Ａ変換したもの）を出力するための信号出力部を設ける構成とすることも可能である。この場合、信号入力部にはマイクロホンが接続され、信号出力部にはスピーカが接続する。 Further, the acoustic signal processing device can be connected to an acoustic signal collecting means (for example, a microphone) for receiving sounds such as voice, music, noise, etc., and input an (analog) signal obtained by the microphone. A signal input unit for receiving the signal and a sound output device (for example, a speaker) for outputting a reproduced signal as sound, and a signal for outputting a signal to be input to the speaker (D / A converted from the reproduced signal) It is also possible to adopt a configuration in which an output unit is provided. In this case, a microphone is connected to the signal input unit, and a speaker is connected to the signal output unit.

音響信号処理装置の外部記憶装置には、音声区間検出のためのプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている〔外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるROMに記憶させておくなどでもよい。
〕。また、このプログラムの処理によって得られるデータなどは、RAMや外部記憶装置などに適宜に記憶される。以下、データやその格納領域のアドレスなどを記憶する記憶手段を単に「○○記憶部」と呼ぶことにする。 In the external storage device of the acoustic signal processing device, a program for voice section detection and data necessary for processing of the program are stored (not limited to the external storage device, for example, the program is read-only storage device). It may be stored in a certain ROM.
]. Further, data and the like obtained by the processing of this program are appropriately stored in a RAM, an external storage device, or the like. Hereinafter, a storage unit that stores data, an address of a storage area thereof, and the like will be simply referred to as an “「 storage unit ”.

この実施形態では、主記憶部に、音響信号に含まれる音声区間よりも時系列的に前の区間の信号を取得するために、離散信号である音響信号を記憶しておく。この記憶はバッファ等の一時的な記憶でもよい。 In this embodiment, in order to acquire a signal in a section chronologically earlier than a speech section included in the sound signal, a sound signal that is a discrete signal is stored in the main storage unit. This storage may be temporary storage such as a buffer.

＜音響信号処理装置の構成＞
図１３は第三実施形態に係る音響信号処理装置の機能ブロック図を、図１４はその処理フローを示す。 <Configuration of acoustic signal processing device>
FIG. 13 is a functional block diagram of the acoustic signal processing device according to the third embodiment, and FIG. 14 shows a processing flow thereof.

音響信号処理装置は、音声区間検出部３２０と、音声区間検出情報蓄積部３３０とを含む。 The acoustic signal processing device includes a voice section detection section 320 and a voice section detection information storage section 330.

音響信号処理装置は、1つのマイクロホン３１０で収音された時系列音響信号と、特定音声区間検出部３４０の出力値とを入力とし、時系列音響信号に含まれる音声区間と非音声区間との少なくとも何れかを検出し、検出結果を出力する。 The audio signal processing device receives the time-series audio signal collected by one microphone 310 and the output value of the specific audio section detection unit 340 as inputs, and determines the audio section and the non-voice section included in the time-series audio signal. At least one is detected and a detection result is output.

なお、特定音声区間検出部３４０は、あらかじめ定められた音(以下「特定音」ともいう)が来たことを検知し、特定音の検出時刻を示す情報を出力する。本実施形態では、特定音は人が発する所定の音声であり、例えば、人が所定のキーワードを発した際の音声である。たとえば参考文献１のような「フレーズスポッティング」などの技術を利用して特定音声区間検出部３４０を実装することができる。
(参考文献１)「センサリ社音声技術説明」、[online]、2010年、[平成29年7月24日検索]、インターネット<URL:http://www.sensory.co.jp/Parts/Docs/SensoryTechnologyJP1003B.pdf>
なお、特定音の検出時刻を示す情報は、少なくとも特定音(例えばキーワード)を言い終わった時刻を示す情報であり、(1-i)特定音を言い終わった時刻そのものを出力してもよいし、(1-ii)特定音を言い終わった時刻に対応する時系列音響信号のフレーム番号を出力してもよいし、(1-iii)特定音を言い終わった時刻以外のフレーム時刻において検出していないことを示す情報(例えば「0」)を出力し、特定音を言い終わった時刻において検出したことを示す情報（例えば「1」）を出力することで特定音を言い終わった時刻を示す情報であってもよく、その他の特定音を言い終わった時刻を示す情報であってもよい。また、特定音の検出時刻を示す情報は、特定音を言い始めた時刻を示す情報を含んでもよく、(2-i)特定音を言い始めた時刻及び言い終わった時刻そのものを出力してもよいし、(2-ii)特定音を言い始めた時刻及び言い終わった時刻に対応する時系列音響信号のフレーム番号を出力してもよいし、(2-iii)特定音を言い始めた時刻から言い終わった時刻までにおいて検出したことを示す情報（例えば「1」）を出力し、それ以外の時刻において検出していないことを示す情報(例えば「0」)を出力することで特定音を言い終わった時刻を示す情報であってもよく、その他の特定音を言い終わった時刻を示す情報であってもよい。 Note that the specific sound section detection unit 340 detects that a predetermined sound (hereinafter, also referred to as “specific sound”) has come, and outputs information indicating a detection time of the specific sound. In the present embodiment, the specific sound is a predetermined sound emitted by a person, for example, a sound generated when a person emits a predetermined keyword. For example, the specific voice section detection unit 340 can be implemented using a technique such as “phrase spotting” as in Reference 1.
(Reference 1) "Sensory Company's speech technology description", [online], 2010, [Search on July 24, 2017], Internet <URL: http://www.sensory.co.jp/Parts/Docs /SensoryTechnologyJP1003B.pdf>
Note that the information indicating the detection time of the specific sound is information indicating at least the time at which the specific sound (for example, the keyword) has been completed, and (1-i) the time at which the specific sound has been completed may be output. (1-ii) The frame number of the time-series sound signal corresponding to the time at which the specific sound is finished may be output, or (1-iii) the frame number may be detected at a frame time other than the time at which the specific sound is finished. Outputs information indicating that the specific sound has not been output (for example, "0"), and outputs information indicating that detection has been performed at the time when the specific sound has been completed (for example, "1") to indicate the time at which the specific sound has been completed The information may be information or information indicating the time at which another specific sound is finished. Further, the information indicating the detection time of the specific sound may include information indicating the time at which the specific sound is started, and (2-i) the time at which the specific sound is started and the end of the specific sound may be output. Or (2-ii) the time at which the specific sound started to be spoken and the frame number of the time-series sound signal corresponding to the time at which the specific sound was finished may be output, or (2-iii) the time at which the specific sound began to be spoken To output the information (for example, “1”) indicating that the sound was detected from the time until the end of the word, and output the information (for example, “0”) indicating that the sound was not detected at other times. It may be information indicating the time at which the user has finished speaking, or may be information indicating the time at which another specific sound has been completed.

以下、各部の処理内容を説明する。 Hereinafter, the processing content of each unit will be described.

＜音声区間検出情報蓄積部３３０＞
音声区間検出情報蓄積部３３０は、特定音の検出時刻を示す情報と時系列音響信号とを入力とし、フレーム単位で特定音音声区間に対応する時系列音響信号の特徴量と、非音声区間に対応する時系列音響信号の特徴量とを求め（Ｓ３３０）、出力する。なお、音声区間検出情報蓄積部３３０を含む各部において各処理はフレーム単位で行われる。 <Voice section detection information storage section 330>
The voice section detection information storage unit 330 receives information indicating the detection time of the specific sound and the time-series sound signal as inputs, and stores a feature amount of the time-series sound signal corresponding to the specific sound voice section in frame units and a non-voice section. The characteristic amount of the corresponding time-series sound signal is obtained (S330) and output. In each unit including the voice section detection information storage unit 330, each process is performed on a frame basis.

図１５に示すように、音声区間検出情報蓄積部３３０は、音声蓄積部３３１と、特定音音声区間算出部３３２と、特徴量算出部３３３とを含む。以下、各部の処理内容を説明する。 As shown in FIG. 15, the voice section detection information storage section 330 includes a voice storage section 331, a specific sound voice section calculation section 332, and a feature amount calculation section 333. Hereinafter, the processing content of each unit will be described.

（音声蓄積部３３１）
音声蓄積部３３１は、音声区間検出対象の時系列音響信号を受け取り、蓄積する。 (Voice storage unit 331)
The sound accumulation unit 331 receives and accumulates the time-series sound signal of the sound section detection target.

（特定音音声区間算出部３３２）
特定音音声区間算出部３３２は、特定音の検出時刻を示す情報を入力とし、検出時刻に基づき特定音に対応する区間と推定される時系列音響信号の区間を特定音音声区間とし、検出時刻に基づき特定音に対応する区間ではないと推定される時系列音響信号の区間を非音声区間と判定し、特定音音声区間を示す情報、非音声区間を示す情報を出力する。例えば、特定音の検出時刻(この例では、特定音を言い終わった時刻)の前のt₁秒間を特定音音声区間とし、特定音音声区間の前のt₂秒間を非音声区間と判定する(図１６参照)。 (Specific sound voice section calculation unit 332)
The specific sound voice section calculation unit 332 receives information indicating the detection time of the specific sound as input, sets a section of the time-series sound signal estimated to be a section corresponding to the specific sound based on the detection time as a specific sound voice section, , A section of the time-series sound signal that is estimated not to be a section corresponding to the specific sound is determined to be a non-voice section, and information indicating a specific sound voice section and information indicating a non-voice section are output. For example, a specific sound voice section is defined as t ₁ second before the specific sound detection time (in this example, the time at which the specific sound is finished), and a non-voice section is determined as t ₂ seconds before the specific sound voice section. (See FIG. 16).

例えば、特定音の検出時刻を示す情報として、特定音を言い終わったフレーム時刻(例えばtとする)を示す情報のみを含む場合、t₁、t₂を予め所定の値にそれぞれ設定しておき、特定音の検出時刻を示す情報から特定音音声区間(t-t₁からtまで)と非音声区間(t-t₁-t₂からt-t₁まで)とを求める。t₁としては特定音を発した際にかかる時間の平均値等を用いてもよい。また、特定音の検出時刻を示す情報として、特定音を言い始めた時刻及び言い終わった時刻(例えばtとする)を示す情報を含む場合、特定音を言い始めた時刻をt-t₁とし、特定音音声区間を特定音を言い始めた時刻t-t₁から言い終わった時刻tまでとする。
また、t₂を予め所定の値に設定しておき、所定の値t₂と、特定音を言い始めた時刻t-t₁とから非音声区間(t-t₁-t₂からt-t₁まで)を求める。 For example, when the information indicating the detection time of the specific sound includes only the information indicating the frame time (for example, t) at which the specific sound is finished, t ₁ and t ₂ are set to predetermined values in advance. Then, a specific sound voice section (from tt ₁ to t) and a non-voice section (from tt ₁ -t ₂ to tt ₁ ) are obtained from the information indicating the detection time of the specific sound. It may be used an average value or the like of the time when triggered by a particular sound as t _1. Further, as information indicating the detection time of a specific sound, if it contains information that indicates a specific sound to say beginning time and Iowa' time (eg, t), the time began to say a particular sound and tt _1, specific and up to time t Iowa' the sound voice interval from the time tt ₁ began to say a specific sound.
Also, previously set to a predetermined value t _2, the predetermined value t _2, determine the non-speech section (from tt ₁ -t ₂ to tt ₁₎ from the time tt ₁ Metropolitan began to say a particular sound.

（特徴量算出部３３３）
特徴量算出部３３３は、特定音音声区間算出部３３２から特定音音声区間を示す情報、非音声区間を示す情報を受け取り、音声蓄積部３３１に蓄積された音声区間検出対象の時系列音響信号を受け取る。そして、特徴量算出部３３３は、時系列音響信号と特定音音声区間とを対応付け、時系列音響信号と非音声区間とを対応付け、特定音音声区間に対応する時系列音響信号からその特徴量である音声区間特徴量を算出し、非音声区間に対応する時系列音響信号からその特徴量である非音声区間特徴量を算出し、音声区間特徴量及び非音声区間特徴量を出力する。特徴量としては、例えば、対数メルスペクトルやケプストラム係数などを用いることができる。但し、第二音響信号分析部３２２が用いる音響特徴量（基本周波数）以外の音響特徴量とするのがよい。特徴量の算出方法としては、どのような方法を用いてもよい。例えば、参考文献４に記載の方法を用いる。
(参考文献４)特開２００９−６３７００号公報 (Feature calculation unit 333)
The feature amount calculation unit 333 receives the information indicating the specific sound voice section and the information indicating the non-voice section from the specific sound voice section calculation unit 332, and converts the time-series sound signal of the voice section detection target stored in the voice storage unit 331. receive. Then, the feature amount calculation unit 333 associates the time-series sound signal with the specific sound voice section, associates the time-series sound signal with the non-voice section, and extracts the characteristic from the time-series sound signal corresponding to the specific sound voice section. A speech section feature amount as a quantity is calculated, a non-speech section feature quantity as the feature quantity is calculated from a time-series sound signal corresponding to the non-speech section, and a speech section feature quantity and a non-speech section feature quantity are output. As the feature amount, for example, a logarithmic mel spectrum, a cepstrum coefficient, or the like can be used. However, it is preferable to use an acoustic feature amount other than the acoustic feature amount (fundamental frequency) used by the second acoustic signal analysis unit 322. Any method may be used to calculate the feature amount. For example, the method described in Reference 4 is used.
(Reference Document 4) JP 2009-63700 A

＜音声区間検出部３２０＞
音声区間検出部３２０は、マイクロホン３１０から時系列音響信号を受け取り、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、音声区間特徴量から音声区間の特徴を示す音声パラメータを求め、非音声区間特徴量から非音声区間の特徴を示す非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。 <Voice section detection section 320>
The voice section detection unit 320 receives a time-series sound signal from the microphone 310, and receives a voice section feature value and a non-voice section feature value from the feature value calculation unit 333. The voice section detection unit 320 obtains a voice parameter indicating a feature of the voice section from the voice section feature quantity, obtains a non-voice parameter indicating a feature of the non-voice section from the non-voice section feature quantity, and determines the voice parameter and the non-voice parameter. Then, at least one of a voice section and a non-voice section is detected from the time-series sound signal (S320), and a detection result is output.

例えば、音声区間検出部３２０は、音声区間を推定する際に用いられる音響モデルのパラメータである音声パラメータを音声区間特徴量から求め、非音声区間を推定する際に用いられる音響モデルのパラメータである非音声パラメータを非音声区間特徴量から求める。 For example, the voice section detection unit 320 obtains a voice parameter, which is a parameter of an acoustic model used when estimating a voice section, from the voice section feature amount, and is a parameter of the acoustic model used when estimating a non-voice section. The non-speech parameter is obtained from the non-speech section feature amount.

例えば、音声区間検出部３２０に参考文献４の音声区間検出装置を利用することができる。この場合、音声パラメータは音声GMMのパラメータであり、非音声パラメータは非音声GMMのパラメータである。 For example, the voice section detection device of Reference 4 can be used for the voice section detection unit 320. In this case, the voice parameter is a voice GMM parameter, and the non-voice parameter is a non-voice GMM parameter.

図１７に示すように、音声区間検出部３２０は、入力の時系列音響信号に対して並列カルマンフィルタ／並列カルマンスムーザを用いて確率計算を行う第一音響信号分析部３２１と、時系列音響信号の周期性成分と非周期性成分の比を用いて確率計算を行う第二音響信号分析部３２２と、それぞれの確率の重みを計算する重み算出部３２３と、算出された重みを用いて、時系列音響信号が音声状態に属する合成確率と非音声状態に属する合成確率を算出し、それぞれの比を求める音声状態／非音声状態合成確率比算出部３２４と、音声状態／非音声状態合成確率比に基づき音声／非音声識別を行う音声区間推定部３２５とを含む。なお、第一音響信号分析部３２１以外の構成については、参考文献４と同様の処理を行うため説明を省略する。 As shown in FIG. 17, the speech section detection unit 320 includes a first acoustic signal analysis unit 321 that performs a probability calculation on an input time-series audio signal using a parallel Kalman filter / parallel Kalman smoother, and a time-series audio signal. A second acoustic signal analysis unit 322 that calculates a probability using the ratio of the periodic component to the non-periodic component, a weight calculation unit 323 that calculates the weight of each probability, and a time A speech state / non-speech state synthesis probability ratio calculating unit 324 for calculating a synthesis probability that a sequence acoustic signal belongs to a speech state and a synthesis probability belonging to a non-speech state, and a speech state / non-speech state synthesis probability ratio And a voice section estimating unit 325 that performs voice / non-voice recognition based on. Note that the configuration other than the first acoustic signal analysis unit 321 performs the same processing as that in Reference Document 4, and a description thereof will be omitted.

第一音響信号分析部３２１へ入力される時系列音響信号は、例えば8,000Hzのサンプリングレートでサンプリングされ、離散信号に変換された音響信号である。この音響信号は、目的信号である音声信号に雑音信号が重畳した音となっている。以下、音響信号を「入力信号」、音声信号を「クリーン音声」、雑音信号を「雑音」と呼ぶ。 The time-series sound signal input to the first sound signal analysis unit 321 is a sound signal sampled at a sampling rate of, for example, 8,000 Hz and converted into a discrete signal. This acoustic signal is a sound in which a noise signal is superimposed on a sound signal as a target signal. Hereinafter, the acoustic signal is referred to as “input signal”, the audio signal is referred to as “clean audio”, and the noise signal is referred to as “noise”.

音声区間検出部３２０は、入力信号、音声区間特徴量及び非音声区間特徴量を受けて、音声区間検出結果を出力する。音声区間検出結果は、フレーム単位の音響信号が音声状態に属すれば１を、非音声状態に属すれば０を取る。音声区間検出部３２０は、音声区間検出結果の値を入力信号にかけ合わせた信号を出力してもよい。すなわち、音声状態に属するフレームの入力信号の値は保持され、非音声状態に属するフレームでは、信号の値が全て０に置換される。 The voice section detection unit 320 receives the input signal, the voice section feature amount, and the non-voice section feature amount, and outputs a voice section detection result. The voice section detection result takes 1 if the audio signal of each frame belongs to a voice state, and takes 0 if it belongs to a non-voice state. The voice section detection unit 320 may output a signal obtained by multiplying the input signal by the value of the voice section detection result. That is, the value of the input signal of the frame belonging to the voice state is held, and the value of the signal is all replaced with 0 in the frame belonging to the non-voice state.

＜第一音響信号分析部３２１＞
第一音響信号分析部３２１は、図１８に示すように、入力信号、音声区間特徴量及び非音声区間特徴量を受けて、音声区間検出に用いる音響特徴量を抽出するための特徴量算出部３２１１と、確率モデルパラメータを推定し、得られた確率モデルパラメータにより構成される確率モデルを用いた入力信号の確率計算を行うための、確率推定部３２１２とを含む。 <First acoustic signal analyzer 321>
As shown in FIG. 18, the first acoustic signal analysis unit 321 receives the input signal, the voice section feature quantity, and the non-voice section feature quantity, and extracts a feature quantity used for voice section detection. 3211 and a probability estimating unit 3212 for estimating a probability model parameter and calculating a probability of an input signal using a probability model constituted by the obtained probability model parameters.

（特徴量算出部３２１１）
特徴量算出部３２１１は、特徴量算出部３３３と同様の方法により、入力信号からその特徴量を算出し、出力する。例えば、24次元の対数メルスペクトルを要素に持つベクトルG_t={g_t,0,…,g_t,φ,…,g_t,23}を算出し、これを出力する。ベクトルG_tは、切り出しの始点の時刻がtのフレームにおける音響特徴量を表す。φはベクトルの要素番号を示す。以下、tをフレーム時刻と呼ぶことにする。 (Feature amount calculation unit 3211)
The feature amount calculation unit 3211 calculates the feature amount from the input signal and outputs the same using the same method as the feature amount calculation unit 333. For example, a vector G _t = {gt _{, 0} ,..., Gt _{, φ} ,..., Gt _{, 23} } having a 24-dimensional logarithmic mel spectrum as an element is calculated and output. Vector G _t is the time of the start point of the cut represents the acoustic features in the frame of t. φ indicates the element number of the vector. Hereinafter, t is referred to as a frame time.

（確率推定部３２１２）
特徴量算出部３２１１の出力である24次元の対数メルスペクトルは、確率推定部３２１２の入力となる。確率推定部３２１２は、入力されたフレームに対して並列非線形カルマンフィルタ、および並列カルマンスムーザを適用し、雑音パラメータを推定する。推定された雑音パラメータを用いて、非音声（雑音＋無音）、および、音声（雑音＋クリーン音声）の確率モデルを生成し、対数メルスペクトルを各確率モデルに入力した際の確率を計算する。 (Probability estimation unit 3212)
The 24-dimensional logarithmic mel spectrum output from the feature amount calculation unit 3211 is input to the probability estimation unit 3212. The probability estimating unit 3212 estimates a noise parameter by applying a parallel nonlinear Kalman filter and a parallel Kalman smoother to the input frame. Using the estimated noise parameters, non-speech (noise + silence) and speech (noise + clean speech) probability models are generated, and the probability when a logarithmic mel spectrum is input to each probability model is calculated.

確率推定部３２１２は図１９に示すように、前向き推定部３２１２−１と、後ろ向き推定部３２１２−２と、GMM（Gaussian Mixture Model）記憶部３２１２−３と、パラメータ記憶部３２１２−４を含む。なお、後ろ向き推定部３２１２−２については、参考文献４と同様の処理を行うため説明を省略する。 As shown in FIG. 19, the probability estimating unit 3212 includes a forward estimating unit 3212-1, a backward estimating unit 3212-2, a GMM (Gaussian Mixture Model) storage unit 3212-3, and a parameter storage unit 3212-4. Note that the backward estimation unit 3212-2 performs the same processing as in Reference Document 4, and a description thereof will be omitted.

GMM記憶部３２１２−３は、あらかじめ用意した無音信号とクリーン音声信号の各音響モデルである無音GMMおよびクリーン音声GMMを記憶する。以下、無音GMMおよびクリーン音声GMMを単にGMMなどと表記する。GMMの構成方法は公知の技術であるので、説明を省略する。GMMはそれぞれ複数の正規分布（たとえば３２個）を含有しており、それぞれの正規分布は、混合重みｗ_j,k 、平均μ_S,j,k,φ、分散Σ_S,j,k,φをパラメータとして構成され、jはGMMの種別（j=0：無音GMM，j=1：クリーン音声GMM）、kは各正規分布の番号を示す。各パラメータは、前向き推定部３２１２−１と後向き推定部３２１２−２への入力となる。 The GMM storage unit 3212-3 stores a silence GMM and a clean speech GMM which are acoustic models of a silence signal and a clean speech signal prepared in advance. Hereinafter, the silent GMM and the clean voice GMM are simply referred to as GMM and the like. The method of configuring the GMM is a known technique, and a description thereof will be omitted. Each GMM contains a plurality of normal distributions (for example, 32), each of which has a mixture weight w _{j, k} , a mean μ _{S, j, k, φ} , and a variance Σ _{S, j, k, φ.} Is a parameter, j is the type of GMM (j = 0: silent GMM, j = 1: clean voice GMM), and k is the number of each normal distribution. Each parameter is input to the forward estimation unit 3212-1 and the backward estimation unit 3212-2.

パラメータ記憶部３２１２−４は、初期雑音モデル推定用バッファと、雑音モデル推定用バッファとを含む。 Parameter storage section 3212-4 includes an initial noise model estimation buffer and a noise model estimation buffer.

［前向き推定部３２１２−１］
前向き推定部３２１２−１における処理内容が参考文献４とは異なる。 [Forward Estimation Unit 3212-1]
The processing content of the forward estimator 3212-1 is different from that of Reference 4.

参考文献４では、前向き推定部において雑音モデルのパラメータ^N_t,j,k,φ、^Σ_N,t,j,k,φを処理の開始時刻から逐次更新で求めていくが、入力されている音が音声か非音声(雑音)かは定めずに非音声・音声GMMのパラメータを更新している。それに対し、本実施形態では、非音声区間と音声区間とが判明しているため、その情報をより積極的に活用してパラメータを更新している。つまり、非音声区間の音声特徴量を利用して非音声GMMのパラメータを更新し、音声区間の音声特徴量を利用して音声GMMのパラメータを更新する。
以下に処理例を示す。 In Reference 4, in the forward estimating unit, the parameters ^ N _{t, j, k, φ} and ^ Σ _{N, t, j, k, φ} of the noise model are sequentially updated from the processing start time. The parameters of the non-voice / voice GMM are updated without determining whether the sound is voice or non-voice (noise). On the other hand, in the present embodiment, since the non-speech section and the speech section are known, the parameter is updated by utilizing the information more actively. That is, the parameter of the non-voice GMM is updated using the voice feature amount of the non-voice section, and the parameter of the voice GMM is updated using the voice feature amount of the voice section.
The following is an example of processing.

まず、前向き推定部３２１２−１は、非音声区間に対応するフレーム時刻t-t₁-t₂からt-t₁までの特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1,φ}を用いて、非音声GMM(j=0)のパラメータを更新する。ただし、下付き添え字t_1、t_2はそれぞれｔ₁,t₂を意味する。 First, the forward estimation unit 3212-1 uses the feature amounts g _{t-t_1-t_2, φ} ,..., G _{t-t_1, φ} from the frame times tt ₁ -t ₂ to tt ₁ corresponding to the non-voice section. , Update the parameters of the non-voice GMM (j = 0). Here, the subscripts t_1 and t_2 mean t ₁ and t ₂ , respectively.

前向き推定部３２１２−１は、初期雑音モデル推定用バッファに、非音声区間特徴量(この例では対数メルスペクトルg_t,φとする)のうち、qフレーム分の非音声区間特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1-t_2-1+q-1,φ}を記憶する。ただし、qは非音声区間の長さt₂を超えない１以上の整数とし、例えばq=10とする。 The forward estimating unit 3212-1 stores, in the buffer for initial noise model estimation, the non-speech section feature amount g _{t− of} q frames among the non-speech section feature amounts (log mel spectrum g _{t, φ in} this example). _{t_1-t_2, φ} , ..., g _{t-t_1-t_2-1 + q-1, φ} are stored. However, q is an integer of 1 or more which does not exceed the length t ₂ of the non-speech section, eg, q = 10.

前向き推定部３２１２−１は、初期雑音モデル推定用バッファからqフレーム分の特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1-t_2-1+q-1,φ}を取り出す。初期の雑音モデルパラメータN^init _φ，Σ^init _N,φを下記各式で推定し、これらを雑音モデル推定用バッファに記憶する。 The forward estimating unit 3212-1 extracts feature amounts g _{t-t_1-t_2, φ} ,..., G _{t-t_1-t_2-1 + q-1, φ} for q frames from the initial noise model estimation buffer. Initial noise model parameters N ^init _φ , Σ ^init _{N, φ} are estimated by the following equations, and these are stored in a noise model estimation buffer.

また、フレーム時刻t-t₁-t₂+qからt-t₁までの特徴量g_{t-t_1-t_2+q,φ}，…，g_{t-t_1,φ}を用いて、非音声GMM(j=0)のパラメータを更新する。なお、非音声GMMのパラメータの更新方法、更新式は参考文献４と同様である。 Also, using the feature amounts g _{t-t_1-t_2 + q, φ} ,..., G _{t-t_1, φ} from the frame times tt ₁ -t ₂ + q to tt ₁ , the non-voice GMM (j = 0) Update parameters. The method of updating the parameters of the non-voice GMM and the updating formula are the same as in Reference 4.

次に、前向き推定部３２１２−１は、音声区間に対応するフレーム時刻t-t₁+1からtまでの特徴量g_{t-t_1+1,φ}，…，g_t,φを用いて、音声GMM(j=1)のパラメータを更新する。なお、非音声区間の最後のフレームを用いて更新したパラメータを、音声区間の最初のパラメータとする。つまり、 Next, the forward estimating unit 3212-1 uses the feature amounts g _{t-t_1 + 1, φ} ,..., G _{t, φ} from the frame times tt ₁ +1 to t corresponding to the voice section to generate the voice GMM ( Update the parameter of j = 1). The parameter updated using the last frame of the non-voice section is set as the first parameter of the voice section. That is,

とする。さらに、特徴量g_{t-t_1+1,φ}，…，g_t,φを用いて、音声GMM(j=1)のパラメータを更新する。なお、音声GMMのパラメータの更新方法、更新式は参考文献４と同様である。 And Further, the parameters of the speech GMM (j = 1) are updated using the feature amounts g _{t−t — 1 + 1, φ} ,..., G _{t, φ} . The method of updating the parameters of the voice GMM and the updating formula are the same as in Reference 4.

なお、フレーム時刻t以降は、従来技術と同様に、入力信号の特徴量を用いて、音声／非音声GMMのパラメータを更新する。 After the frame time t, the parameters of the voice / non-voice GMM are updated using the feature amount of the input signal, as in the conventional technology.

音声区間検出部３２０は、非音声区間の音声特徴量を利用して更新した非音声GMMのパラメータと、音声区間の音声特徴量を利用して更新した音声GMMのパラメータとに基づき、フレーム時刻t以降において、入力信号の特徴量を用いて音声／非音声GMMのパラメータを更新し、その結果得られるパラメータを用いて音声／非音声を判定する。そのため、音声か非音声(雑音)かは定めずに非音声・音声GMMのパラメータを更新する従来技術と比較して、その判定精度を向上させることができる。 The voice section detection unit 320 calculates the frame time t based on the parameters of the non-voice GMM updated using the voice features of the non-voice section and the parameters of the voice GMM updated using the voice features of the voice section. Thereafter, the parameters of the speech / non-speech GMM are updated using the feature amount of the input signal, and speech / non-speech is determined using the parameters obtained as a result. Therefore, the determination accuracy can be improved as compared with the related art in which the parameters of the non-speech / speech GMM are updated without determining whether the speech or non-speech (noise) is used.

なお、上述の処理は、最初に特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取ったときのみ行ってもよいし、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る度に行ってもよい。また、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る度に行う場合、毎回、(a)初期の雑音モデルパラメータＮ^init _φ，Σ^init _Ｎ,φを求める処理や(b)非音声区間の最後のフレームを用いて更新したパラメータを音声区間の最初のパラメータとする処理を含む全ての処理を繰り返してもよいし、2回目以降の処理においては上述の(a)や(b)の処理を行わずに音声区間特徴量と非音声区間特徴量とを受け取った時点のパラメータをそのまま用いて、非音声区間に対応するフレーム時刻t-t₁-t₂からt-t₁までの特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1,φ}を用いて非音声GMM(j=0)のパラメータを更新し、音声区間に対応するフレーム時刻t-t₁+1からtまでの特徴量g_{t-t_1,φ}，…，g_t,φを用いて、音声GMM(j=1)のパラメータを更新してもよい。 Note that the above-described processing may be performed only when the voice section feature amount and the non-voice section feature amount are first received from the feature amount calculation section 333, or may be performed from the feature amount calculation section 333. It may be performed each time the section feature value is received. In addition, every time when the speech section feature amount and the non-speech section feature amount are received from the feature amount calculation unit 333, (a) the process of obtaining the initial noise model parameters N ^init _φ and Σ ^init _{N, φ} b) All the processes including the process of using the parameter updated using the last frame of the non-voice section as the first parameter of the voice section may be repeated, and in the second and subsequent processes, the above-described (a) and Using the parameters at the time of receiving the speech section features and the non-speech section features without performing the processing of (b), the features from the frame times tt ₁ -t ₂ to tt ₁ corresponding to the non-speech section The parameters of the non-speech GMM (j = 0) are updated using the quantities g _{t-t_1-t_2, φ} ,..., G _{t-t_1, φ,} and the frame time tt ₁ +1 to t corresponding to the speech section is updated. The parameters of the speech GMM (j = 1) may be updated using the feature amounts g _{t−t — 1, φ} ,..., G _{t, φ} .

＜効果＞
以上の構成により、対象者(ユーザ)の特定の発話に対してキーワード検出を行った結果を利用して、目的音声を含む周囲の音響環境に関する情報をより正確に知ることができ、音声区間検出の信号処理が頑健になる。特に、認識したい音声と雑音とが近しい特性を持つ場合であっても、従来よりも高精度で音声区間と非音声区間との少なくとも何れかを検出することができる。 <Effect>
With the above configuration, it is possible to more accurately know information about the surrounding acoustic environment including the target voice using the result of keyword detection for a specific utterance of the target person (user), and Signal processing becomes robust. In particular, even when the speech to be recognized and the noise have characteristics close to each other, it is possible to detect at least one of the speech section and the non-speech section with higher accuracy than before.

なお、1つのマイクロホン３１０や特定音声区間検出部３４０を音響信号処理装置の一部としてもよい。また、本実施形態では、音声区間、非音声区間を推定する際に用いられる音響モデルとしてGMMを用いたが、HMM(Hidden Markov Model)等の他の音響モデルを用いてもよい。その場合にも、本実施形態と同様に、音声パラメータ、非音声パラメータをそれぞれ音声区間特徴量、非音声区間特徴量から求めればよい。 In addition, one microphone 310 and the specific sound section detection unit 340 may be a part of the acoustic signal processing device. Further, in the present embodiment, the GMM is used as an acoustic model used when estimating a speech section and a non-speech section, but another acoustic model such as an HMM (Hidden Markov Model) may be used. Also in this case, similarly to the present embodiment, the voice parameter and the non-voice parameter may be obtained from the voice section feature and the non-voice section feature, respectively.

＜第三実施形態の第一変形例＞
第三実施形態と異なる部分を中心に説明する。 <First Modification of Third Embodiment>
The description will focus on the differences from the third embodiment.

第三実施形態では、特徴量としては、対数メルスペクトルやケプストラム係数などを用いたが、他の特徴量を用いてもよい。本変形例では、より単純に音声のレベルを判定に用いる場合を考える。 In the third embodiment, a logarithmic mel spectrum, a cepstrum coefficient, or the like is used as a feature value, but another feature value may be used. In the present modified example, a case where the level of the voice is more simply used for the determination will be considered.

本実施形態では、特徴量として平均パワーを用いる。そのため、特徴量算出部３３３では、特定音音声区間に対応する時系列音響信号からその平均パワーを算出し音声区間特徴量として出力し、非音声区間に対応する時系列音響信号からその平均パワーを算出し非音声区間特徴量として出力する。 In the present embodiment, the average power is used as the feature amount. Therefore, the feature amount calculation unit 333 calculates the average power from the time-series sound signal corresponding to the specific sound voice section, outputs the average power as the voice section feature amount, and calculates the average power from the time-series sound signal corresponding to the non-voice section. It is calculated and output as a non-voice section feature amount.

＜音声区間検出部３２０＞
音声区間検出部３２０は、音声蓄積部３３１に蓄積された音声区間検出対象の時系列音響信号を受け取り、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、音声区間特徴量から音声区間の特徴を示す音声パラメータを求め、非音声区間特徴量から非音声区間の特徴を示す非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。 <Voice section detection section 320>
The voice section detection unit 320 receives the time-series sound signal of the voice section detection target stored in the voice storage unit 331, and receives the voice section feature value and the non-voice section feature value from the feature value calculation unit 333. The voice section detection unit 320 obtains a voice parameter indicating a feature of the voice section from the voice section feature quantity, obtains a non-voice parameter indicating a feature of the non-voice section from the non-voice section feature quantity, and determines the voice parameter and the non-voice parameter. Then, at least one of a voice section and a non-voice section is detected from the time-series sound signal (S320), and a detection result is output.

図２０に示すように、音声区間検出部３２０は、音声パワー計算部３２６と、音声／非音声判定部３２７と、非音声レベル記憶部３２８と、音声レベル記憶部３２９とを含む。 As shown in FIG. 20, the voice section detection unit 320 includes a voice power calculation unit 326, a voice / non-voice determination unit 327, a non-voice level storage unit 328, and a voice level storage unit 329.

音声パワー計算部３２６は、音声蓄積部３３１に蓄積された音声区間検出対象の時系列音響信号を受け取り、時系列音響信号のフレームn毎の平均パワーP(n)を計算し、出力する。 The audio power calculation unit 326 receives the time-series audio signal of the audio section detection target stored in the audio storage unit 331, calculates the average power P (n) of the time-series audio signal for each frame n, and outputs the average power P (n).

例えば、
P(n)＞γV、かつ P(n)＞δN
を満たす場合に、その区間を音声区間と判定する方法が考えられる。nはフレーム時刻を表すインデックス、N,Vはそれぞれ非音声レベル記憶部３２８、音声レベル記憶部３２９に格納されている非音声区間のパワー閾値、音声区間のパワー閾値、γは0以上1以下、δは1以上の実数とする。音声区間の信号のレベルにある程度近い値(γV)より大きく、非音声区間(例えば雑音)の信号のレベルより十分大きい値(δN)よりも大きい場合に音声区間である、と判定する。この場合、あらかじめ格納してある非音声と音声の情報(V、N)と実際の音声区間、非音声区間の信号のレベルが異なる場合に正しく動作しない。またそれぞれの情報(V、N)を時系列音響信号に応じて逐次更新をしていくことも考えられるが、どの区間が非音声または音声かわからないまま更新をするため誤った方向へ値が更新されるリスクがある。 For example,
P (n)> γV, and P (n)> δN
If the condition is satisfied, a method of determining that section as a voice section may be considered. n is an index indicating the frame time, N and V are the power threshold of the non-voice section and the power threshold of the voice section stored in the non-voice level storage section 328 and the voice level storage section 329, respectively. δ is a real number of 1 or more. A voice section is determined to be a voice section if it is larger than a value (γV) that is somewhat close to the level of a signal in a voice section and is larger than a value (ΔN) that is sufficiently larger than a signal level in a non-voice section (for example, noise). In this case, if the information (V, N) of the non-speech and the speech stored in advance and the signal levels of the actual speech section and the non-speech section are different, the operation does not work properly. It is also conceivable to update each information (V, N) sequentially according to the time-series sound signal, but the value is updated in the wrong direction because it is updated without knowing which section is non-voice or voice. Risk.

本実施形態では、音声区間特徴量（音声区間の平均パワー）と非音声区間特徴量（非音声区間の平均パワー）とを用いて、パワー閾値V、Nを変更する。 In the present embodiment, the power thresholds V and N are changed using the voice section feature (the average power of the voice section) and the non-voice section feature (the average power of the non-voice section).

音声／非音声判定部３２７は、非音声レベル記憶部３２８、音声レベル記憶部３２９からそれぞれパワー閾値V、Nを取り出し、音声パワー計算部３２６から平均パワーP(n)を受け取り、特徴量算出部３３３から特定音音声区間に対応する時系列音響信号の平均パワーPvと非音声区間に対応する時系列音響信号の平均パワーPnとを受け取る。 The voice / non-voice determination unit 327 extracts the power thresholds V and N from the non-voice level storage unit 328 and the voice level storage unit 329, respectively, receives the average power P (n) from the voice power calculation unit 326, and From 333, the average power Pv of the time-series sound signal corresponding to the specific sound voice section and the average power Pn of the time-series sound signal corresponding to the non-voice section are received.

音声／非音声判定部３２７は、パワー閾値V、Nを次式により、それぞれ平均パワーPv、Pnを考慮したパワー閾値V'、N'に置換える。
N’ = （1-α）N + αPn
V’ = （1-β）V + βPv
なおα、βは検出した音声・非音声区間の寄与率を決定するパラメータ（0<α<1、 0<β<1）を表す。音声／非音声判定部３２７は、
P(n)＞γV'、かつ P(n)＞δN'
を満たす場合に、そのフレームnに対応する区間を音声区間として検出し、満たさない場合に、そのフレームnに対応する区間を非音声区間として検出し、検出結果を出力する。 The voice / non-voice determination unit 327 replaces the power thresholds V and N with power thresholds V ′ and N ′ in consideration of the average powers Pv and Pn, respectively, by the following equations.
N '= (1-α) N + αPn
V '= (1-β) V + βPv
Note that α and β represent parameters (0 <α <1, 0 <β <1) that determine the contribution rate of the detected voice / non-voice section. The voice / non-voice determination unit 327
P (n)> γV 'and P (n)>δN'
If the condition is satisfied, the section corresponding to the frame n is detected as a voice section. If the condition is not satisfied, the section corresponding to the frame n is detected as a non-voice section, and a detection result is output.

本実施形態の場合、V'が音声区間の特徴を示す音声パラメータに相当し、N'が非音声区間の特徴を示す非音声パラメータに相当する。 In the case of the present embodiment, V ′ corresponds to a voice parameter indicating a feature of a voice section, and N ′ corresponds to a non-voice parameter indicating a feature of a non-voice section.

＜効果＞
以上の構成により、より実際の状況に即したレベル判定が行うことができ、第三実施形態と同様の効果を得ることができる。 <Effect>
With the above configuration, it is possible to perform the level determination more in accordance with the actual situation, and it is possible to obtain the same effect as in the third embodiment.

＜第三実施形態の第二変形例＞
第三実施形態と異なる部分を中心に説明する。 <Second Modification of Third Embodiment>
The description will focus on the differences from the third embodiment.

図１３は第三実施形態に係る音響信号処理装置の機能ブロック図を、図１４はその処理フローを示す。 FIG. 13 is a functional block diagram of the acoustic signal processing device according to the third embodiment, and FIG. 14 shows a processing flow thereof.

音響信号処理装置は、音声区間検出部３２０と、音声区間検出情報蓄積部３３０と、前処理部３５０とを含む。 The acoustic signal processing device includes a voice section detection section 320, a voice section detection information storage section 330, and a preprocessing section 350.

＜前処理部３５０＞
前処理部３５０は、時系列音響信号を入力とし、時系列音響信号に含まれる音声を強調する処理（音声強調処理）を行い(Ｓ３５０)、強調後の時系列音響信号を出力する。音声強調処理としては、どのような方法を用いてもよい。例えば、参考文献２に記載の雑音抑圧方法を用いる。
（参考文献２）特開２００９−１１００１１号公報 <Pre-processing unit 350>
The preprocessing unit 350 receives the time-series sound signal as input, performs processing (voice enhancement processing) for emphasizing the sound included in the time-series sound signal (S350), and outputs the emphasized time-series sound signal. Any method may be used as the voice enhancement processing. For example, a noise suppression method described in Reference Document 2 is used.
(Reference Document 2) JP-A-2009-110011

＜効果＞
以上の構成により、第三実施形態と同様の効果を得ることができる。さらに、音声強調処理を施した時系列音響信号を用いて後段の処理（Ｓ３３０、Ｓ３２０）を行うことで、その検出精度を向上させることができる。 <Effect>
With the above configuration, the same effects as in the third embodiment can be obtained. Furthermore, the detection accuracy can be improved by performing the subsequent processing (S330, S320) using the time-series sound signal subjected to the voice enhancement processing.

＜第三実施形態の第三変形例＞
第三実施形態と異なる部分を中心に説明する。 <Third Modification of Third Embodiment>
The description will focus on the differences from the third embodiment.

音響信号処理装置は、M個のマイクロホン３１０−ｍ(m=1,2,…,Mであり、Mは2以上の整数の何れか)でそれぞれ収音されたM個の時系列音響信号と、特定音声区間検出部３４０のL(Lは2以上の整数の何れか)個の出力値とを入力とし、時系列音響信号に含まれる音声区間と非音声区間との少なくとも何れかを検出し、検出結果を出力する。 The acoustic signal processing apparatus includes M time-series acoustic signals collected by M microphones 310-m (m = 1, 2,..., M, where M is any integer of 2 or more). And L (L is any integer of 2 or more) output values of the specific voice section detection unit 340 as inputs, and detects at least one of a voice section and a non-voice section included in the time-series sound signal. And outputs the detection result.

図２１は第三変形例に係る音響信号処理装置の機能ブロック図を、図２２はその処理フローを示す。 FIG. 21 is a functional block diagram of an acoustic signal processing device according to a third modification, and FIG. 22 shows a processing flow thereof.

音響信号処理装置は、ビームフォーミング部３６０と、音声区間検出部３２０と、音声区間検出情報蓄積部３３０とを含む。 The acoustic signal processing device includes a beam forming section 360, a voice section detection section 320, and a voice section detection information storage section 330.

＜ビームフォーミング部３６０＞
ビームフォーミング部３６０は、M個の時系列音響信号を入力とし、M個の時系列音響信号をL個の方向へそれぞれ指向性を高めたL個の時系列信号(時系列音響信号であり、例えばビームフォーミング出力信号)に変換し(Ｓ３６０)、特定音声区間検出部３４０、音声区間検出情報蓄積部３３０、音声区間検出部３２０に出力する。例えば、ビームフォーミング技術を用いてL個の時系列ビームフォーミング出力信号に変換する。ビームフォーミング技術としては、どのような方法を用いてもよい。例えば、参考文献３に記載の方法を用いる。
（参考文献３）特開２０１７−１０７１４１号公報 <Beam forming unit 360>
The beamforming unit 360 receives the M time-series sound signals as input, and increases the directivity of the M time-series sound signals in the L directions in L time-series signals (a time-series sound signal, For example, a beamforming output signal is converted (S360) and output to the specific voice section detection section 340, the voice section detection information storage section 330, and the voice section detection section 320. For example, the signal is converted into L time-series beamforming output signals using a beamforming technique. Any method may be used as the beam forming technique. For example, the method described in Reference Document 3 is used.
(Reference 3) JP-A-2017-107141

なお、特定音声区間検出部３４０では、L個の時系列信号それぞれについて、特定音が来たことを検知し、特定音の検出時刻を示す情報を音声区間検出情報蓄積部３３０に出力する。なお、L個の時系列信号のうちの少なくとも１つの時系列信号に特定音が来たことを検知するものとし、特定音の検出時刻を示す情報は、検知した１つ以上のチャンネルを示す情報と、検知した１つ以上のチャンネルにそれぞれ対応する１つ以上の特定音の検出時刻を示す情報とを含む情報である。各特定音の検出時刻を示す情報は第三実施形態で説明した通りである。 The specific voice section detection unit 340 detects that a specific sound has come for each of the L time-series signals, and outputs information indicating the detection time of the specific sound to the voice section detection information storage unit 330. In addition, it is assumed that a specific sound is detected in at least one of the L time-series signals, and the information indicating the detection time of the specific sound is information indicating one or more detected channels. And information indicating detection times of one or more specific sounds respectively corresponding to the detected one or more channels. The information indicating the detection time of each specific sound is as described in the third embodiment.

＜音声区間検出情報蓄積部３３０＞
音声区間検出情報蓄積部３３０は、特定音の検出時刻を示す情報とL個の時系列信号とを入力とし、特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを求め（Ｓ３３０）、出力する。なお、特定音が検出されたチャンネル全てについて特徴量を求める。 <Voice section detection information storage section 330>
The voice section detection information storage unit 330 receives information indicating the detection time of the specific sound and L time-series signals, and obtains voice section feature amounts and non-voice section feature amounts of the channel in which the specific sound is detected. (S330), and output. Note that feature amounts are obtained for all channels in which the specific sound is detected.

＜音声区間検出部３２０＞
音声区間検出部３２０は、L個の時系列信号を受け取り、特徴量算出部３３３から特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、特定音が検出されたチャンネル全ての音声区間特徴量から音声区間の特徴を示す1つの音声パラメータを求め、特定音が検出されたチャンネル全ての非音声区間特徴量から非音声区間の特徴を示す1つの非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて、L個の時系列信号それぞれから音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。検出方法は第三実施形態で説明した通りである。本変形例では、L個の時系列信号に対して1つの(共通の)音声パラメータ及び1つの(共通の)非音声パラメータを用いる。 <Voice section detection section 320>
The voice section detection unit 320 receives the L time-series signals, and receives, from the feature amount calculation unit 333, the voice section feature amount and the non-voice section feature amount of the channel in which the specific sound is detected. The voice section detection unit 320 obtains one voice parameter indicating a feature of the voice section from the voice section feature amounts of all the channels in which the specific sound is detected, and obtains a non-voice section feature amount of all the channels in which the specific sound is detected. One non-speech parameter indicating a feature of the speech section is obtained, and at least one of the speech section and the non-speech section is detected from each of the L time-series signals using the speech parameter and the non-speech parameter (S320). And outputs the detection result. The detection method is as described in the third embodiment. In this modification, one (common) speech parameter and one (common) non-speech parameter are used for L time-series signals.

＜効果＞
このような構成により、第三実施形態と同様の効果を得ることができる。なお、ビームフォーミング部３６０を別装置とし、音響信号処理装置は、L個の時系列信号を入力とする構成としてもよい。また、L個の方向へそれぞれ指向性を高めたL個の指向性のマイクロホン３１０−ｍ(m=1,2,…,Lであり、Lは2以上の整数の何れか)でそれぞれ収音されたL個の時系列音響信号を入力とし、ビームフォーミング部３６０を用いない構成としてもよい。 <Effect>
With such a configuration, the same effect as in the third embodiment can be obtained. Note that the beamforming unit 360 may be a separate device, and the acoustic signal processing device may have a configuration in which L time-series signals are input. In addition, sound is picked up by L directional microphones 310-m (m = 1, 2,..., L, where L is any integer of 2 or more), each of which has increased directivity in L directions. A configuration may be adopted in which the input L time-series sound signals are input and the beamforming unit 360 is not used.

＜第三実施形態の第四変形例＞
第三変形例と異なる部分を中心に説明する。 <Fourth Modification of Third Embodiment>
The following description focuses on the differences from the third modification.

＜音声区間検出部３２０＞
音声区間検出部３２０は、L個の時系列信号を受け取り、特徴量算出部３３３から特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、特定音が検出された1つのチャンネルの音声区間特徴量から音声区間の特徴を示す1つの音声パラメータを求め、特定音が検出された1つのチャンネルの非音声区間特徴量から非音声区間の特徴を示す1つの非音声パラメータを求め、特定音が検出されたチャンネル毎に求めた音声パラメータと非音声パラメータとを用いて、特定音が検出された時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。検出方法は第三実施形態で説明した通りである。 <Voice section detection section 320>
The voice section detection unit 320 receives the L time-series signals, and receives, from the feature amount calculation unit 333, the voice section feature amount and the non-voice section feature amount of the channel in which the specific sound is detected. The voice section detection unit 320 obtains one voice parameter indicating a feature of the voice section from the voice section feature value of one channel in which the specific sound is detected, and obtains a non-voice section feature value of one channel in which the specific sound is detected. One non-speech parameter indicating the feature of the non-speech section is calculated from the time-series sound signal in which the specific sound is detected using the sound parameter and the non-speech parameter obtained for each channel in which the specific sound is detected. At least one of the section and the non-voice section is detected (S320), and the detection result is output. The detection method is as described in the third embodiment.

本変形例ではL個の時系列信号にそれぞれ対応するL個の音声パラメータ及びL個の非音声パラメータを用いる。なお、音声区間検出部３２０は、特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを受け取り、そのチャンネルの非音声パラメータ及び音声パラメータのみを求める。特定音が検出されなかったチャンネルについては、非音声パラメータ及び音声パラメータを求めず、特定音が検出されたタイミングでそのチャンネルに対応する非音声パラメータ及び音声パラメータを求める。 In this modification, L speech parameters and L non-speech parameters respectively corresponding to L time-series signals are used. Note that the voice section detection unit 320 receives the voice section feature amount and the non-voice section feature amount of the channel in which the specific sound is detected, and obtains only the non-voice parameter and the voice parameter of the channel. The non-speech parameter and the speech parameter are not determined for the channel in which the specific sound is not detected, and the non-speech parameter and the speech parameter corresponding to the channel are determined at the timing when the specific sound is detected.

＜効果＞
このような構成により、第三実施形態と同様の効果を得ることができ、チャンネル毎に詳細な音声パラメータ、非音声パラメータを求めることができる。 <Effect>
With such a configuration, the same effect as in the third embodiment can be obtained, and detailed voice parameters and non-voice parameters can be obtained for each channel.

［補足］
音響信号処理装置は、予め定められた音である特定音を含む音響信号を入力とし、上記音響信号から上記特定音に対応する音響信号を除いた音響信号を雑音音響信号として、上記雑音音響信号と、上記特定音に対応する音響信号とを関連付けた音響信号処理を行う音響信号処理部を備えていると言える。 [Supplement]
The sound signal processing device receives a sound signal including a specific sound which is a predetermined sound as an input, and sets the sound signal obtained by removing the sound signal corresponding to the specific sound from the sound signal as a noise sound signal, the noise sound signal And an audio signal processing unit that performs audio signal processing in which the audio signal corresponding to the specific sound is associated.

または、音響信号処理装置は、予め定められた音である特定音を含む音響信号を入力とし、上記特定音に対応する音響信号を対象音響信号として、上記対象音響信号と、上記音響信号から上記対象音響信号を除いた音響信号とを関連付けた音響信号処理を行う音響信号処理部を備えていると言える。 Alternatively, the acoustic signal processing device receives an acoustic signal including a specific sound that is a predetermined sound as an input, and sets an acoustic signal corresponding to the specific sound as a target acoustic signal, the target acoustic signal, and the sound signal. It can be said that the apparatus includes an audio signal processing unit that performs audio signal processing in which an audio signal other than the target audio signal is associated with the audio signal.

または、音響信号処理装置は、予め定められた音である特定音を含む音響信号を入力とし、上記音響信号から上記特定音に対応する音響信号を除いた音響信号を雑音音響信号とし、上記特定音に対応する音響信号を対象音響信号として、上記対象音響信号と、上記雑音音響信号とを関連付けた音響信号処理を行う音響信号処理部を備えていると言える。 Alternatively, the acoustic signal processing apparatus receives an acoustic signal including a specific sound that is a predetermined sound as an input, and sets an acoustic signal obtained by removing an acoustic signal corresponding to the specific sound from the acoustic signal as a noise acoustic signal, It can be said that the audio signal processing unit includes an audio signal processing unit that performs audio signal processing in which the audio signal corresponding to the sound is set as the target audio signal and the target audio signal is associated with the noise audio signal.

音響信号処理部の例は、第一実施形態の変形例４の第三指向性集音部５２である。この場合、対象音響信号は方向記憶部１３から読み込んだ到来方向からの音の信号であり、雑音音響信号は雑音方向記憶部５１から読み込んだ到来方向からの音の信号となる。 An example of the acoustic signal processing unit is the third directional sound collection unit 52 of Modification 4 of the first embodiment. In this case, the target acoustic signal is a sound signal from the arrival direction read from the direction storage unit 13, and the noise acoustic signal is a sound signal from the arrival direction read from the noise direction storage unit 51.

音響信号処理部の他の例は、第三実施形態の音声区間検出情報蓄積部３３０及び音声区間検出部３２０である。この場合、対象音響信号は特定音音声区間に対応する時系列音響信号であり、雑音音響信号は非音声区間に対応する時系列音響信号となる。 Another example of the audio signal processing unit is the voice section detection information storage unit 330 and the voice section detection unit 320 of the third embodiment. In this case, the target acoustic signal is a time-series sound signal corresponding to the specific sound voice section, and the noise sound signal is a time-series sound signal corresponding to the non-voice section.

［プログラム及び記録媒体］
各音響信号処理装置の各部における処理をコンピュータによって実現する場合、これらの装置の各部がが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各部の処理がコンピュータ上で実現される。 [Program and recording medium]
When the processing in each unit of each acoustic signal processing device is realized by a computer, the processing content of the function that each unit of these devices should have is described by a program. By executing this program on a computer, the processing of each unit is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 A program describing this processing content can be recorded on a computer-readable recording medium. As a computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各部の処理は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, the processing of each unit may be configured by executing a predetermined program on a computer, or at least a part of these processing may be realized by hardware.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, it goes without saying that changes can be made as appropriate without departing from the spirit of the present invention.

Claims

The direction of arrival of sound from signals collected by a plurality of microphones is estimated as the direction of arrival of the specific sound is closer to the direction of arrival estimated when a specific sound that is a predetermined sound is detected. A direction estimating unit for estimating so as to be easily performed,
A first directional sound collection unit that performs sound collection so that the sound from the arrival direction estimated by the direction estimation unit is emphasized,
When the direction estimating unit estimates a plurality of directions of arrival when the specific sound is detected,
Utilizing the direction of arrival estimated in the past predetermined section, to select the direction of arrival of the specific sound from the plurality of directions of arrival,
Sound signal processing device.

The acoustic signal processing device according to claim 1,
The selection is performed such that, in the predetermined section in the past, the arrival direction of the sound source that continuously outputs sound is not selected as the arrival direction of the specific sound.
Sound signal processing device.

The acoustic signal processing device according to claim 1,
The selection is performed such that, in the past predetermined section, the arrival direction having a high frequency of the estimated arrival direction is not selected as the arrival direction of the specific sound.
Sound signal processing device.

The acoustic signal processing device according to claim 1,
The selection is performed such that, in the predetermined section in the past, the direction of arrival with the lowest frequency of estimated directions of arrival is selected as the direction of arrival of the specific sound.
Sound signal processing device.

The acoustic signal processing device according to any one of claims 1 to 4,
The first directional sound collection unit performs sound collection so that a sound from an arrival direction closer to the arrival direction estimated to be the arrival direction of the specific sound is more emphasized.
Sound signal processing device.

The direction estimating unit determines the direction of arrival of the sound from the signals collected by the plurality of microphones, the closer the arrival direction of the specific sound is to the estimated arrival direction when a specific sound that is a predetermined sound is detected. A direction estimating step of estimating so as to be easily estimated as a direction;
The first directional sound collection unit includes a first directional sound collection step of performing sound collection so that sound from the arrival direction estimated by the direction estimation unit is emphasized,
In the direction estimating step, when a plurality of arrival directions are estimated when the specific sound is detected,
Utilizing the direction of arrival estimated in the past predetermined section, to select the direction of arrival of the specific sound from the plurality of directions of arrival,
Sound signal processing method.

A computer-readable program for causing a computer to function as each unit of the acoustic signal processing device according to claim 1.