JP2019028300A

JP2019028300A - Acoustic signal processing apparatus, method and program

Info

Publication number: JP2019028300A
Application number: JP2017148354A
Authority: JP
Inventors: 翔一郎齊藤; Shoichiro Saito; 小林　和則; Kazunori Kobayashi; 和則小林; 弘章伊藤; Hiroaki Ito; 登原田; Noboru Harada; 卓哉樋口; Takuya Higuchi; 荒木　章子; Akiko Araki; 章子荒木; 慶介木下; Keisuke Kinoshita; 信貴伊藤; Nobutaka Ito; 中谷　智広; Tomohiro Nakatani
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2019-02-21
Anticipated expiration: 2037-07-31
Also published as: JP6599408B2

Abstract

To provide an acoustic signal processing apparatus and the like that can detect at least either of a voice section or a non-voice section more accurately than before even when voice desired to recognize and noise have close characteristics.SOLUTION: An acoustic signal processing apparatus determines that a section of a time-sequence acoustic signal which is estimated as a section corresponding to a specific sound on the basis of detection time is a specific sound voice section, determines that a section of a time-sequence acoustic signal which is estimated not as the section corresponding to the specific sound on the basis of the detection time is a non-voice section, calculates a voice section feature amount being its feature amount from the time-sequence acoustic signal corresponding to the specific sound voice section, calculates a non-voice section feature amount being its feature amount from the time-sequence acoustic signal corresponding to the non-voice section, obtains a voice parameter indicating a feature of the voice section from the voice section feature amount, obtains non-voice parameter indicating a feature of the non-voice section from the non-voice section feature amount, and detects at least either of the voice section or the non-voice section from the time-sequence acoustic signal using the voice parameter and the non-voice parameter.SELECTED DRAWING: Figure 13

Description

この発明は、音響信号の処理技術に関する。 The present invention relates to an acoustic signal processing technique.

音声認識などのアプリケーションにおいて、「音声がどの区間で発話されているか」を検知することは、重要である。しかし、たとえば雑音環境下では発話／非発話時の信号の差が小さくなり、単純に音声の音量で音声区間を検知をすることは難しい。音声区間を検出する方法としてたとえば特許文献１がある。特許文献１では、音声モデルを利用し、音響信号の音声らしさ／非音声らしさを判定して区間検出の精度を上げている。 In an application such as voice recognition, it is important to detect “in which section the voice is spoken”. However, for example, in a noisy environment, the difference in signal between utterances and non-utterances becomes small, and it is difficult to simply detect the speech section with the sound volume. For example, Patent Document 1 discloses a method for detecting a voice section. In Patent Document 1, the accuracy of section detection is improved by using a speech model to determine whether the sound signal is speech / non-speech.

特開２００９−６３７００号公報JP 2009-63700 A

しかしながら、たとえばTVの音や音楽が背景雑音にあり、認識したい音声と雑音とが近しい特性を持つ場合、特許文献１の方法では音声区間の検出が難しくなる。 However, for example, when the sound or music of the TV is in the background noise and the voice to be recognized and the noise have characteristics close to each other, the method of Patent Document 1 makes it difficult to detect the voice section.

本発明は、認識したい音声と雑音とが近しい特性を持つ場合であっても、従来よりも高精度で音声区間と非音声区間との少なくとも何れかを検出することができる音響信号処理装置、方法及びプログラムを提供することを目的とする。 The present invention provides an acoustic signal processing apparatus and method capable of detecting at least one of a speech segment and a non-speech segment with higher accuracy than in the past even when the speech to be recognized and noise have close characteristics. And to provide a program.

上記の課題を解決するために、本発明の一態様によれば、音響信号処理装置は、人が発する所定の音声である特定音を含む時系列音響信号と、時系列音響信号に含まれる特定音の検出時刻を示す情報とを入力とし、検出時刻に基づき特定音に対応する区間と推定される時系列音響信号の区間を特定音音声区間とし、検出時刻に基づき特定音に対応する区間ではないと推定される時系列音響信号の区間を非音声区間と判定する特定音音声区間算出部と、特定音音声区間に対応する時系列音響信号からその特徴量である音声区間特徴量を算出し、非音声区間に対応する時系列音響信号からその特徴量である非音声区間特徴量を算出する特徴量算出部と、音声区間特徴量から音声区間の特徴を示す音声パラメータを求め、非音声区間特徴量から非音声区間の特徴を示す非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出する音声区間検出部とを含む。 In order to solve the above problems, according to one aspect of the present invention, an acoustic signal processing device includes a time-series acoustic signal including a specific sound that is a predetermined sound emitted by a person, and a specification included in the time-series acoustic signal. Information indicating the detection time of the sound as an input, the section corresponding to the specific sound based on the detection time as a specific sound voice section that is estimated as the section corresponding to the specific sound, and in the section corresponding to the specific sound based on the detection time A specific sound voice interval calculation unit that determines that a time-series acoustic signal section that is estimated to be a non-speech section is calculated, and a voice section feature amount that is a feature amount is calculated from a time-series acoustic signal corresponding to the specific sound voice section. A feature amount calculation unit for calculating a feature amount of a non-speech segment that is a feature amount from a time-series acoustic signal corresponding to the non-speech segment; and a speech parameter indicating a feature of the speech segment from the speech segment feature amount; Non-sound from features Seeking non-voice parameters indicating the characteristics of the section, and a speech section detection unit for detecting at least one of the time-series audio signals and the voice section and the non-speech section using the speech parameters and non-speech parameters.

上記の課題を解決するために、本発明の他の態様によれば、音響信号処理方法は、音響信号処理装置が、人が発する所定の音声である特定音を含む時系列音響信号と、時系列音響信号に含まれる特定音の検出時刻を示す情報とを用いて、検出時刻に基づき特定音に対応する区間と推定される時系列音響信号の区間を特定音音声区間とし、検出時刻に基づき特定音に対応する区間ではないと推定される時系列音響信号の区間を非音声区間と判定する特定音音声区間算出ステップと、特定音音声区間に対応する時系列音響信号からその特徴量である音声区間特徴量を算出し、非音声区間に対応する時系列音響信号からその特徴量である非音声区間特徴量を算出する特徴量算出ステップと、音声区間特徴量から音声区間の特徴を示す音声パラメータを求め、非音声区間特徴量から非音声区間の特徴を示す非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出する音声区間検出ステップとを含む。 In order to solve the above problems, according to another aspect of the present invention, an acoustic signal processing method includes: a time-series acoustic signal including a specific sound that is a predetermined sound emitted by a person; Using the information indicating the detection time of the specific sound included in the sequence acoustic signal, the section corresponding to the specific sound based on the detection time as the specific sound voice section, and based on the detection time A specific sound / voice section calculation step for determining a section of a time-series sound signal estimated not to be a section corresponding to a specific sound as a non-speech section, and a feature amount from the time-series sound signal corresponding to the specific sound / voice section A feature amount calculating step for calculating a speech segment feature value and calculating a non-speech segment feature value, which is a feature value from a time-series acoustic signal corresponding to the non-speech segment, and a voice indicating the feature of the speech segment from the speech segment feature value Parameters Obtaining a non-speech parameter indicating the feature of the non-speech segment from the non-speech segment feature quantity, and detecting at least one of the speech segment and the non-speech segment from the time-series acoustic signal using the speech parameter and the non-speech parameter A speech segment detection step.

本発明によれば、認識したい音声と雑音とが近しい特性を持つ場合であっても、従来よりも高精度で音声区間と非音声区間との少なくとも何れかを検出することができるという効果を奏する。 Advantageous Effects of Invention According to the present invention, there is an effect that it is possible to detect at least one of a speech segment and a non-speech segment with higher accuracy than in the past even when the speech to be recognized and noise have close characteristics. .

第一実施形態の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of 1st embodiment. 第一実施形態の変形例１の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of the modification 1 of 1st embodiment. 第一実施形態の変形例２の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of the modification 2 of 1st embodiment. 第一実施形態の変形例３の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of the modification 3 of 1st embodiment. 音響信号処理方法の例を説明するための流れ図。The flowchart for demonstrating the example of the acoustic signal processing method. 第二実施形態の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of 2nd embodiment. 第二実施形態の方向推定部２２の例を説明するためのブロック図。The block diagram for demonstrating the example of the direction estimation part 22 of 2nd embodiment. 第二実施形態の方向推定部２２の例を説明するためのブロック図。The block diagram for demonstrating the example of the direction estimation part 22 of 2nd embodiment. 第二実施形態の変形例２の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of the modification 2 of 2nd embodiment. 第二実施形態の変形例３の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of the modification 3 of 2nd embodiment. 音響信号処理方法の例を説明するための流れ図。The flowchart for demonstrating the example of the acoustic signal processing method. 背景技術の指向性集音装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the directional sound collector of background art. 第三実施形態に係る音響信号処理装置の機能ブロック図。The functional block diagram of the acoustic signal processing apparatus which concerns on 3rd embodiment. 第三実施形態に係る音響信号処理装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the acoustic signal processing apparatus which concerns on 3rd embodiment. 第三実施形態に係る音声区間検出情報蓄積部の機能ブロック図。The functional block diagram of the audio | voice area detection information storage part which concerns on 3rd embodiment. 特定音音声区間、非音声区間を説明するための図。The figure for demonstrating a specific sound audio | voice area and a non-voice area. 第三実施形態に係る音声区間検出部の機能ブロック図。The functional block diagram of the audio | voice area detection part which concerns on 3rd embodiment. 第三実施形態に係る第一音響信号分析部の機能ブロック図。The functional block diagram of the 1st acoustic signal analysis part which concerns on 3rd embodiment. 第三実施形態に係る確率推定部の機能ブロック図。The functional block diagram of the probability estimation part which concerns on 3rd embodiment. 第三実施形態の第一変形例に係る音声区間検出部の機能ブロック図。The functional block diagram of the audio | voice area detection part which concerns on the 1st modification of 3rd embodiment. 第三実施形態の第三変形例、第四変形例に係る音響信号処理装置の機能ブロック図。The functional block diagram of the acoustic signal processing apparatus which concerns on the 3rd modification of 3rd embodiment, and a 4th modification. 第三実施形態の第三変形例、第四変形例に係る音響信号処理装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the acoustic signal processing apparatus which concerns on the 3rd modification of 3rd embodiment, and a 4th modification. 第一実施形態の変形例４の音響信号処理装置の例を説明するためのブロック図。The block diagram for demonstrating the example of the acoustic signal processing apparatus of the modification 4 of 1st embodiment. 第一実施形態の変形例４の音響信号処理方法の例を説明するための流れ図。The flowchart for demonstrating the example of the acoustic signal processing method of the modification 4 of 1st embodiment.

以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。以下の説明において、テキスト中で使用する記号「^」等は、本来直後の文字の真上に記載されるべきものであるが、テキスト記法の制限により、当該文字の直前に記載する。式中においてはこれらの記号は本来の位置に記述している。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 In the drawings used for the following description, components having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. In the following explanation, the symbol “^” etc. used in the text should be described immediately above the character immediately after it, but it is described immediately before the character due to restrictions on the text notation. In the formula, these symbols are written in their original positions. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

［技術的背景］
音響信号処理装置は、予め定められた音である特定音についての情報が与えられているとして、その特定音についての情報を用いて音響信号処理を行うものである。事前に与えられた特定音についての情報を用いることにより、使える情報が増えるため、より精度の高い音響信号処理を行うことができる。 [Technical background]
The acoustic signal processing apparatus performs acoustic signal processing using information about the specific sound, assuming that information about the specific sound that is a predetermined sound is given. By using information about a specific sound given in advance, usable information increases, so that more accurate acoustic signal processing can be performed.

音響信号処理の例は、音の到来方向の推定、指向性集音、目的音声の抽出、音声区間の検出、音声認識である。 Examples of acoustic signal processing are estimation of sound arrival direction, directional sound collection, target speech extraction, speech section detection, and speech recognition.

例えば、ユーザの特定の発話に対して特定音であるキーワードの検出を行うことで、目的音声の信号区間と雑音の信号区間を正確に把握でき、その後の処理に活かすことができる。 For example, by detecting a keyword that is a specific sound for a user's specific utterance, the signal section of the target speech and the signal section of the noise can be accurately grasped, and can be utilized for subsequent processing.

また、この性質を音声区間検出に用いると、雑音区間と音声区間の信号がそれぞれ判明するため、音声／非音声の判定のためのパラメータをより実測値に即した値へ更新することができる。 Further, when this property is used for speech section detection, the noise section and the speech section signals are respectively found, and therefore the parameters for speech / non-speech determination can be updated to values that more closely match the actually measured values.

また、音響信号処理として音声の方向推定を行う場合には、特定音を検出した方向を音声の方向とみなすことで、本来の方向以外から音声を含む音が到来したとしても方向推定が頑健に動作する。 Also, when performing speech direction estimation as acoustic signal processing, the direction in which the specific sound is detected is regarded as the direction of the speech, so that the direction estimation is robust even if sound including speech comes from other than the original direction. Operate.

また、音響信号処理として目的音声抽出を行う場合には、音声区間と非音声区間の信号が精度よく得られるため、音声分離のためのステアリングベクトルを計算するための空間相関行列をより正確に求めることができる。 In addition, when target speech extraction is performed as acoustic signal processing, signals in speech sections and non-speech sections can be obtained with high accuracy, so that a spatial correlation matrix for calculating a steering vector for speech separation is obtained more accurately. be able to.

また、音響信号処理として音声認識を行う場合には、雑音レベルをより正確に得られるため、音響モデルの選択により精度を向上させることができる。 Also, when performing speech recognition as acoustic signal processing, the noise level can be obtained more accurately, so that accuracy can be improved by selecting an acoustic model.

以下、図面を参照して、各実施形態について説明する。 Hereinafter, each embodiment will be described with reference to the drawings.

［第一実施形態］
第一実施形態の音響信号処理装置及び方法は、音響信号処理として指向性集音処理を行う。 [First embodiment]
The acoustic signal processing apparatus and method according to the first embodiment performs directional sound collection processing as acoustic signal processing.

音響信号処理装置は、図１１に示すように、方向推定部１１、特定音検出部１２、方向記憶部１３及び第一指向性集音部１４を例えば備えている。音響信号処理装置は、特定音検出部１２を備えていなくてもよい。 As shown in FIG. 11, the acoustic signal processing device includes, for example, a direction estimation unit 11, a specific sound detection unit 12, a direction storage unit 13, and a first directivity sound collection unit 14. The acoustic signal processing device may not include the specific sound detection unit 12.

音響信号処理方法は、音響信号処理装置が、図５及び以下に説明するステップＳ１１からステップＳ１４の処理を行うことにより例えば実現される。 The acoustic signal processing method is realized, for example, by the acoustic signal processing apparatus performing the processing from step S11 to step S14 described below with reference to FIG.

方向推定部１１は、複数のマイクロホンで集音された信号から音の到来方向を推定する（ステップＳ１１）。方向推定部１１は、各時刻における音の到来方向を推定する。推定された各時刻における音の到来方向は、方向記憶部１３に出力される。 The direction estimation unit 11 estimates the arrival direction of sound from signals collected by a plurality of microphones (step S11). The direction estimation unit 11 estimates the arrival direction of sound at each time. The estimated direction of arrival of the sound at each time is output to the direction storage unit 13.

方向推定部１１による方向推定の方式は任意である。方向推定部１１は、例えば特許文献１，２に記載された方向推定技術により音の到来方向を推定する。音の到来方向は、方向ではなく、位置により表されるものであってもよい。 The direction estimation method by the direction estimation unit 11 is arbitrary. The direction estimation unit 11 estimates the direction of arrival of sound using, for example, the direction estimation technique described in Patent Literatures 1 and 2. The direction of arrival of sound may be represented not by direction but by position.

特定音検出部１２は、予め定められた音である特定音を検出する（ステップＳ１２）。予め定められた音の例は、特定のキーワードの音声、口笛及び手拍子である。予め定められた音として、上記の例以外の所定の音が用いられてもよい。 The specific sound detection unit 12 detects a specific sound that is a predetermined sound (step S12). Examples of predetermined sounds are voice, whistle, and clapping of a specific keyword. A predetermined sound other than the above example may be used as the predetermined sound.

方向記憶部１３には、特定音検出部１２で特定音が検出された時刻における、方向推定部１１で推定された到来方向が記憶される。より詳細には、方向記憶部１３は、方向推定部１１から入力された各時刻における音の到来方向のうち、特定音検出部１２で特定音が検出された時刻における音の到来方向を記憶する。 The direction storage unit 13 stores the arrival direction estimated by the direction estimation unit 11 at the time when the specific sound is detected by the specific sound detection unit 12. More specifically, the direction storage unit 13 stores the sound arrival direction at the time when the specific sound is detected by the specific sound detection unit 12 among the sound arrival directions at each time input from the direction estimation unit 11. .

第一指向性集音部１４は、方向記憶部１３から読み込んだ到来方向からの音が強調されるように集音を行う（ステップＳ１４）。第一指向性集音部１４による指向性集音の方式は任意である。第一指向性集音部１４は、例えば特開２００９−４４５８８号公報に記載された指向性集音を行う。 The first directivity sound collection unit 14 performs sound collection so that the sound from the direction of arrival read from the direction storage unit 13 is emphasized (step S14). The method of directivity sound collection by the first directivity sound collection unit 14 is arbitrary. The first directivity sound collection unit 14 performs directivity sound collection described in, for example, Japanese Patent Application Laid-Open No. 2009-44588.

このように、特定音が発せられた音源を集音すべき音源と判別して、その音源を指向性集音することで、高ＳＮ比で集音することができる。ユーザは、特定のキーワード等の特定音を発することで、指向性の向きを変えることができ、テレビなどの音源が存在している場合でも、自分に対して指向性を向けて、その後固定することができる。 In this way, it is possible to collect sound with a high S / N ratio by discriminating the sound source from which the specific sound is emitted as the sound source to be collected and collecting the sound source with directional sound. The user can change the direction of the directivity by emitting a specific sound such as a specific keyword. Even when a sound source such as a TV is present, the user directs the directivity toward the user and then fixes it. be able to.

なお、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図１では、遅延部１５を破線で示している。遅延部１５は、特定音検出部１２による特定音の検出の時間に対応する時間だけ方向推定部１１からの出力を遅延させてから方向記憶部１３に入力する。これにより、特定音の検出に遅延があっても正常に動作する。 In addition, when it takes time for the specific sound to be detected by the specific sound detection unit 12, a delay unit 15 that delays the specific sound by a time corresponding to the time may be placed after the direction estimation unit 11. In FIG. 1, the delay unit 15 is indicated by a broken line. The delay unit 15 delays the output from the direction estimation unit 11 by a time corresponding to the time of detection of the specific sound by the specific sound detection unit 12 and then inputs the delay to the direction storage unit 13. Thereby, it operates normally even if there is a delay in the detection of the specific sound.

[[第一実施形態の変形例１]]
図２に例示するように、音響信号処理装置は、推定頻度計測部１６及び選択部１７を更に備えていてもよい。この場合、方向推定部１１は、複数方向の同時推定が可能であってもよい。すなわち、方向推定部１１は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能であってもよい。この場合、どちらの音源で特定音が発せられたかの判別ができなくなってしまうので、推定頻度計測部１６が、過去に方向推定がどのくらい行われたかで、その判別を行う。すなわち、推定頻度計測部１６は、ＴＶ等の音源は常に音が出力されているので、過去に多数の方向推定が行われているものと考えられるので、これを手掛かりに判別する。 [[First Modification of First Embodiment]]
As illustrated in FIG. 2, the acoustic signal processing device may further include an estimated frequency measurement unit 16 and a selection unit 17. In this case, the direction estimation unit 11 may be capable of simultaneous estimation in a plurality of directions. That is, the direction estimating unit 11 may be able to estimate the directions of both sound sources when there is a noise source sound simultaneously with the specific sound. In this case, since it becomes impossible to determine which sound source has generated the specific sound, the estimation frequency measurement unit 16 performs the determination based on how much direction estimation has been performed in the past. That is, since the sound source such as the TV always outputs sound, the estimated frequency measuring unit 16 determines that many directions have been estimated in the past.

推定頻度計測部１６は、過去の所定の時間区間における、方向推定部１１で推定された到来方向の頻度を計測する（ステップＳ１６）。すなわち、推定頻度計測部１６は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部１７に出力される。 The estimated frequency measuring unit 16 measures the frequency of the arrival direction estimated by the direction estimating unit 11 in a predetermined past time interval (step S16). That is, the estimated frequency measuring unit 16 measures how often the direction is estimated within a certain past time. Information about the measured frequency is output to the selection unit 17.

例えば、過去Ｔ秒の間に、方向推定部１１の出力が方向θであった時間をA(θ)秒とすれば、θ方向の推定頻度は、それらの比D(θ)＝A(θ)/Ｔで求められる。推定頻度計測部１６は、この頻度を各方向についてすべて求める。雑音源がテレビや音楽受聴用のスピーカであると想定した場合、長時間、ほとんど無音になることなく、同じ方向から音が発せられることになる。このような音源がθ方向にあった場合、推定頻度D(θ)は１に近い大きな値をとることになる。 For example, if the time during which the output of the direction estimation unit 11 is in the direction θ during the past T seconds is A (θ) seconds, the estimated frequency in the θ direction is the ratio D (θ) = A (θ ) / T. The estimated frequency measuring unit 16 obtains all the frequencies in each direction. If it is assumed that the noise source is a TV or a speaker for listening to music, the sound is emitted from the same direction with almost no silence for a long time. When such a sound source is in the θ direction, the estimated frequency D (θ) takes a large value close to 1.

選択部１７は、推定頻度計測部１６で計測された頻度の中で最も低い頻度の到来方向を選択する。例えば、選択部１７は、方向推定部１１の出力の推定方向が２個であった場合に、推定頻度D(θ)が小さい方を選択する。特定音検出部１２で特定音が検出された時刻における、選択部１７で選択された到来方向が、方向記憶部１３に記憶される。 The selection unit 17 selects the arrival direction with the lowest frequency among the frequencies measured by the estimated frequency measurement unit 16. For example, when there are two estimated directions of the output of the direction estimation unit 11, the selection unit 17 selects the one with the smaller estimation frequency D (θ). The direction of arrival selected by the selection unit 17 at the time when the specific sound is detected by the specific sound detection unit 12 is stored in the direction storage unit 13.

その後、第一指向性集音部１４は、上記と同様にして、方向記憶部１３から読み込んだ到来方向からの音が強調されるように集音を行う。 Thereafter, the first directivity sound collecting unit 14 performs sound collection so that the sound from the arrival direction read from the direction storage unit 13 is emphasized in the same manner as described above.

なお、第一実施形態の変形例１においても、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図２では、遅延部１５を破線で示している。これにより、特定音の検出に遅延があっても正常に動作する。 Also in the first modification of the first embodiment, when it takes time for the specific sound to be detected by the specific sound detection unit 12, the delay unit 15 that delays by the time corresponding to the time is provided after the direction estimation unit 11. You may put in. In FIG. 2, the delay unit 15 is indicated by a broken line. Thereby, it operates normally even if there is a delay in the detection of the specific sound.

[[第一実施形態の変形例２]]
図３に例示するように、音響信号処理装置は、第二指向性集音部１８を更に備えていてもよい。 [[Modification 2 of the first embodiment]]
As illustrated in FIG. 3, the acoustic signal processing device may further include a second directional sound collection unit 18.

特定音検出部１２の処理の前に、第二指向性集音部１８による指向性集音を行うことで、より高精度な特定音の検出を行うことができる。 By performing the directional sound collection by the second directional sound collection unit 18 before the process of the specific sound detection unit 12, it is possible to detect the specific sound with higher accuracy.

第二指向性集音部１８には、複数のマイクロホンで集音された信号を遅延させた信号が入力される。この遅延は、方向推定部１１による到来方向の推定処理に必要な時間に対応する時間の長さを持つ。この遅延は、図３に破線で示されている遅延部１９により行われる。また、第二指向性集音部１８には、方向推定部１１で推定された到来方向が入力される。 A signal obtained by delaying signals collected by a plurality of microphones is input to the second directivity sound collecting unit 18. This delay has a length of time corresponding to the time required for the arrival direction estimation processing by the direction estimation unit 11. This delay is performed by a delay unit 19 indicated by a broken line in FIG. Further, the arrival direction estimated by the direction estimation unit 11 is input to the second directivity sound collection unit 18.

第二指向性集音部１８は、方向推定部１１で推定された到来方向からの音が強調されるように集音を行う（ステップＳ１８）。より詳細には、第二指向性集音部１８は、複数のマイクロホンで集音された信号を遅延させた信号を用いて、方向推定部１１で推定された到来方向からの音が強調されるように集音を行う。第二指向性集音部１８で集音された信号は、特定音検出部１２に出力される。 The second directivity sound collection unit 18 collects sound so that the sound from the direction of arrival estimated by the direction estimation unit 11 is emphasized (step S18). More specifically, the second directional sound collection unit 18 emphasizes the sound from the direction of arrival estimated by the direction estimation unit 11 using a signal obtained by delaying signals collected by a plurality of microphones. So that the sound is collected. The signal collected by the second directional sound collection unit 18 is output to the specific sound detection unit 12.

特定音検出部１２は、第二指向性集音部１８により集音された信号に基づいて特定音を検出する。その後の処理は、上記と同様である。 The specific sound detection unit 12 detects a specific sound based on the signal collected by the second directional sound collection unit 18. Subsequent processing is the same as described above.

なお、図３に示すように、複数の第二指向性集音部１８が音響信号処理装置に備えられていてもよい。この場合、第二指向性集音部１８の数と同数の特定音検出部１２が音響信号処理装置に備えられている。 In addition, as shown in FIG. 3, the some 2nd directivity sound collection part 18 may be provided in the acoustic signal processing apparatus. In this case, the same number of specific sound detection units 12 as the number of second directivity sound collection units 18 are provided in the acoustic signal processing device.

この場合、方向推定部１１で複数の到来方向が推定された場合には、特定音検出部１２は、推定された複数の到来方向のそれぞれを強調するように動作し、それらの出力がそれぞれ複数の特定音検出部１２に入力され、特定音の検出が行われる。 In this case, when a plurality of arrival directions are estimated by the direction estimation unit 11, the specific sound detection unit 12 operates to emphasize each of the estimated plurality of arrival directions, and a plurality of outputs thereof are provided. The specific sound is input to the specific sound detection unit 12 and the specific sound is detected.

これにより、複数の特定音検出部１２で特定音が検出された場合に、優先順位を付けることが可能となる。 Thereby, when a specific sound is detected by a plurality of specific sound detection units 12, it becomes possible to give priority.

なお、第一実施形態の変形例２においても、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図２では、遅延部１５を破線で示している。これにより、特定音の検出に遅延があっても正常に動作する。 Also in the second modification of the first embodiment, when it takes time for the specific sound to be detected by the specific sound detection unit 12, the delay unit 15 that delays by the time corresponding to the time is included in the subsequent stage of the direction estimation unit 11. You may put in. In FIG. 2, the delay unit 15 is indicated by a broken line. Thereby, it operates normally even if there is a delay in the detection of the specific sound.

[[第一実施形態の変形例３]]
図４に例示するように、第一実施形態の変形例２において、第一実施形態の変形例１で説明した推定頻度計測部１６及び選択部１７を音響信号処理装置は更に備えていてもよい。この場合、方向推定部１１は、複数方向の同時推定が可能であってもよい。すなわち、方向推定部１１は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能であってもよい。 [[Modification 3 of the first embodiment]]
As illustrated in FIG. 4, in Modification 2 of the first embodiment, the acoustic signal processing device may further include the estimation frequency measurement unit 16 and the selection unit 17 described in Modification 1 of the first embodiment. . In this case, the direction estimation unit 11 may be capable of simultaneous estimation in a plurality of directions. That is, the direction estimating unit 11 may be able to estimate the directions of both sound sources when there is a noise source sound simultaneously with the specific sound.

推定頻度計測部１６及び選択部１７の処理は、第一実施形態の変形例１で説明したものと同様である。 The processes of the estimation frequency measurement unit 16 and the selection unit 17 are the same as those described in the first modification of the first embodiment.

すなわち、推定頻度計測部１６は、過去の所定の時間区間における、方向推定部１１で推定された到来方向の頻度を計測する（ステップＳ１６）。すなわち、推定頻度計測部１６は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部１７に出力される。 That is, the estimated frequency measuring unit 16 measures the frequency of the arrival direction estimated by the direction estimating unit 11 in a past predetermined time interval (step S16). That is, the estimated frequency measuring unit 16 measures how often the direction is estimated within a certain past time. Information about the measured frequency is output to the selection unit 17.

なお、第一実施形態の変形例１においても、特定音検出部１２による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部１５を方向推定部１１の後段に入れてもよい。図４では、遅延部１５を破線で示している。これにより、特定音の検出に遅延があっても正常に動作する。 Also in the first modification of the first embodiment, when it takes time for the specific sound to be detected by the specific sound detection unit 12, the delay unit 15 that delays by the time corresponding to the time is provided after the direction estimation unit 11. You may put in. In FIG. 4, the delay unit 15 is indicated by a broken line. Thereby, it operates normally even if there is a delay in the detection of the specific sound.

[[第一実施形態の変形例４]]
図２３に例示するように、音響信号処理装置は、第一指向性集音部１４にかえて第三指向性集音部５２を備えるとともに、雑音方向記憶部５１をさらに備えてもよい。 [[Modification 4 of the first embodiment]]
As illustrated in FIG. 23, the acoustic signal processing device may include a third directional sound collecting unit 52 instead of the first directional sound collecting unit 14 and may further include a noise direction storage unit 51.

音響信号処理方法は、音響信号処理装置が、図２４及び以下に説明するステップＳ３１の処理を行うことにより例えば実現される。 The acoustic signal processing method is realized, for example, by the acoustic signal processing device performing the processing of FIG. 24 and step S31 described below.

雑音方向記憶部５１には、特定音検出部１２で特定音が検出された時刻を除く、方向推定部１１で推定された到来方向が記憶される。ここで、特定音が検出された時刻を除くとは、特定音が検出された時刻よりも時系列的に前の時刻であってもよいし時系列的に後の時刻であってもよいし前の時刻と後の時刻両方であってもよい。なお、雑音方向記憶部５１の前段かつ方向推定部１１の後段に遅延部１５を入れてもよいのは言うまでもない。 The noise direction storage unit 51 stores the arrival direction estimated by the direction estimation unit 11 excluding the time when the specific sound is detected by the specific sound detection unit 12. Here, excluding the time when the specific sound is detected may be a time before the time when the specific sound is detected or may be a time after the time series. It may be both the previous time and the later time. Needless to say, the delay unit 15 may be inserted before the noise direction storage unit 51 and after the direction estimation unit 11.

第三指向性集音部５２は方向記憶部１３から読み込んだ到来方向からの音が強調されるようにかつ雑音方向記憶部５１から読み込んだ到来方向からの音が抑圧されるように集音を行う（ステップＳ５２）。第三指向性集音部５２による指向性集音の方式は任意である。第三指向性集音部５２が行う指向性集音の方式は、例えば参考文献５に記載の方式を用いてもよい。
（参考文献５）浅野太著, 「音のアレイ信号処理」, pp.82-85，コロナ社, 2011. The third directivity sound collecting unit 52 collects the sound so that the sound from the direction of arrival read from the direction storage unit 13 is emphasized and the sound from the direction of arrival read from the noise direction storage unit 51 is suppressed. This is performed (step S52). The method of directivity sound collection by the third directivity sound collection unit 52 is arbitrary. As a method of directivity sound collection performed by the third directivity sound collecting unit 52, for example, the method described in Reference 5 may be used.
(Reference 5) Tadashi Asano, “Sound Array Signal Processing”, pp.82-85, Corona, 2011.

［第二実施形態］
第一実施形態の音響信号処理装置及び方法は、音響信号処理として指向性集音処理を行う。 [Second Embodiment]
The acoustic signal processing apparatus and method according to the first embodiment performs directional sound collection processing as acoustic signal processing.

音響信号処理装置は、図６に示すように、特定音検出部２１、方向推定部２２、第一指向性集音部２３を例えば備えている。音響信号処理装置は、特定音検出部１２を備えていなくてもよい。 As shown in FIG. 6, the acoustic signal processing device includes, for example, a specific sound detection unit 21, a direction estimation unit 22, and a first directivity sound collection unit 23. The acoustic signal processing device may not include the specific sound detection unit 12.

音響信号処理方法は、音響信号処理装置が、図１１及び以下に説明するステップＳ２１からステップＳ２３の処理を行うことにより例えば実現される。 The acoustic signal processing method is realized, for example, by the acoustic signal processing apparatus performing the processing from step S21 to step S23 described below with reference to FIG.

特定音検出部２１は、予め定められた音である特定音を検出する（ステップＳ２１）。予め定められた音の例は、特定のキーワードの音声、口笛及び手拍子である。予め定められた音として、上記の例以外の所定の音が用いられてもよい。 The specific sound detection unit 21 detects a specific sound that is a predetermined sound (step S21). Examples of predetermined sounds are voice, whistle, and clapping of a specific keyword. A predetermined sound other than the above example may be used as the predetermined sound.

方向推定部２２は、複数のマイクロホンで集音された信号から音の到来方向を推定する（ステップＳ２２）。その際、方向推定部２２は、複数のマイクロホンで集音された信号から音の到来方向を、特定音検出部２１において特定音が検出された時刻において推定された到来方向に近い方向ほど到来方向であると推定されやすくなるように推定する。 The direction estimation unit 22 estimates the arrival direction of sound from signals collected by a plurality of microphones (step S22). At that time, the direction estimation unit 22 determines the arrival direction of the sound from the signals collected by the plurality of microphones, and the direction closer to the arrival direction estimated at the time when the specific sound is detected by the specific sound detection unit 21. It is estimated so that it is easy to be estimated.

すなわち、方向推定部２２では、特定音の検出の結果に応じて、各方向への検出されやすさが設定される。言い換えれば、方向推定部２２では、特定音の検出時に推定されていた方向に近いほど、方向検出がされやすくなり、遠いほど検出されにくくなる。こうすることにより、特定音を発したユーザに対し指向性が向きやすくなり、雑音源に指向性が向きにくくなる。また、特定音を発したユーザが移動してもそれに追従することができる。 That is, in the direction estimation unit 22, the ease of detection in each direction is set according to the result of detection of the specific sound. In other words, in the direction estimation unit 22, the closer to the direction estimated at the time of detecting the specific sound, the easier the direction is detected, and the farther the direction is, the harder it is to detect. By doing so, the directivity is easily directed to the user who has emitted the specific sound, and the directivity is not easily directed to the noise source. Moreover, even if the user who emitted the specific sound moves, it can follow it.

方向推定部２２の構成の例を、図７に示す。図７に例示するように、方向推定部２２は、方向強調部２２１、パワー計算部２２２、重み乗算部２２３、最大パワー方向検出部２２４及び重み決定部２２５を備えている。 An example of the configuration of the direction estimation unit 22 is shown in FIG. As illustrated in FIG. 7, the direction estimation unit 22 includes a direction enhancement unit 221, a power calculation unit 222, a weight multiplication unit 223, a maximum power direction detection unit 224, and a weight determination unit 225.

複数のマイクロホンで集音された信号のそれぞれは、方向強調部２２１に入力される。 Each of the signals collected by the plurality of microphones is input to the direction enhancement unit 221.

方向強調部２２１は、複数のマイクロホンで集音された信号に対し、複数の方向をそれぞれ強調するように方向強調処理を行う（ステップＳ２２１）。例えば、N個の方向強調部２２１が設けられている場合には、θ1,θ2,…,θNを互いに異なる方向として、N個の方向強調部２２１は、それぞれθ1,θ2,…,θNの方向を強調するように方向強調処理を行う。強調された信号は、パワー計算部２２２に出力される。 The direction emphasizing unit 221 performs direction emphasis processing on the signals collected by the plurality of microphones so as to emphasize each of the plurality of directions (step S221). For example, when N direction enhancing units 221 are provided, θ1, θ2,..., ΘN are set as directions different from each other, and the N direction enhancing units 221 have directions of θ1, θ2,. The direction emphasis process is performed so as to emphasize. The emphasized signal is output to the power calculation unit 222.

パワー計算部２２２は、方向強調部２２１で強調された信号のパワーを計算する（ステップＳ２２２）。計算されたパワーは、重み乗算部２２３に出力される。 The power calculation unit 222 calculates the power of the signal emphasized by the direction enhancement unit 221 (step S222). The calculated power is output to the weight multiplication unit 223.

重み乗算部２２３は、パワー計算部２２２で計算されたパワーに、重み設定部２２５で設定された重みを乗じる（ステップＳ２２３）。重み付与後パワーは、最大パワー方向検出部２２４に出力される。後述するように、したがって、重み乗算部２２３は、各到来方向が強調された信号のパワーに、上記各到来方向が上記選択された到来方向に近いほど大きな重みを乗算することにより重み付与後パワーを得る。 The weight multiplication unit 223 multiplies the power calculated by the power calculation unit 222 by the weight set by the weight setting unit 225 (step S223). The power after weighting is output to the maximum power direction detection unit 224. As will be described later, therefore, the weight multiplier 223 multiplies the power of the signal in which each arrival direction is emphasized by multiplying the power after weighting by increasing the weight as the arrival direction is closer to the selected arrival direction. Get.

最大パワー方向検出部２２４は、重み乗算部２２３の出力のうち最大パワーの到来方向を選択する。言い換えれば、最大パワー方向検出部２２４は、重み付与後パワーが最も大きい到来方向を選択し、その選択された到来方向を推定される到来方向とする（ステップＳ２２４）。推定された到来方向は、方向推定結果として、重み決定部２２５及び第一指向性集音部２３に出力される。 The maximum power direction detection unit 224 selects the arrival direction of the maximum power among the outputs of the weight multiplication unit 223. In other words, the maximum power direction detection unit 224 selects the arrival direction having the largest weighted power, and sets the selected arrival direction as the estimated arrival direction (step S224). The estimated arrival direction is output to the weight determination unit 225 and the first directivity sound collection unit 23 as a direction estimation result.

重み設定部２２５は、特定音検出部２１で特定音が検出された時刻において、最大パワー方向検出部２２４が出力した方向推定結果に対応する重みを決定する。決定された重みは、重み乗算部２２３に出力される。言い換えれば、重み設定部２２５は、特定音の検出がありとなったときに、方向推定結果に対応した重みを設定する。 The weight setting unit 225 determines a weight corresponding to the direction estimation result output by the maximum power direction detection unit 224 at the time when the specific sound is detected by the specific sound detection unit 21. The determined weight is output to the weight multiplier 223. In other words, the weight setting unit 225 sets a weight corresponding to the direction estimation result when the specific sound is detected.

方向推定結果に対応した重みは、推定された到来方向に対する重みが大きくなり、その到来方向から離れるにしたがって、重みが小さくなるように設定される。例えば、推定された到来方向に対する重みを1.0とし、その推定された到来方向から10度ずれるごとに1.0未満の乗数（例えば0.8）を乗じた重みが設定される。 The weight corresponding to the direction estimation result is set such that the weight with respect to the estimated arrival direction increases and the weight decreases as the distance from the arrival direction increases. For example, the weight for the estimated direction of arrival is set to 1.0, and a weight obtained by multiplying a multiplier (for example, 0.8) less than 1.0 is set every time the estimated direction of arrival is deviated by 10 degrees.

第一指向性集音部２３は、方向推定部２２で推定された到来方向からの音が強調されるように集音を行う（ステップＳ２３）。第一指向性集音部２３による指向性集音の方式は任意である。第一指向性集音部２３は、例えば特開２００９−４４５８８号公報に記載された指向性集音を行う。 The first directivity sound collection unit 23 performs sound collection so that the sound from the direction of arrival estimated by the direction estimation unit 22 is emphasized (step S23). The method of directivity collection by the first directivity sound collection unit 23 is arbitrary. The first directivity sound collection unit 23 performs directivity sound collection described in, for example, Japanese Patent Application Laid-Open No. 2009-44588.

なお、特定音検出部２１による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部２２６を最大パワー方向検出部２２４の後段に入れてもよい。図７では、遅延部２２６を破線で示している。遅延部２２６は、特定音検出部２１による特定音の検出の時間に対応する時間だけ最大パワー方向検出部２２４からの出力を遅延させてから重み設定部２２５に入力する。これにより、特定音の検出に遅延があっても正常に動作する。 When it takes time to detect a specific sound by the specific sound detection unit 21, a delay unit 226 that delays by a time corresponding to the time may be placed after the maximum power direction detection unit 224. In FIG. 7, the delay unit 226 is indicated by a broken line. The delay unit 226 delays the output from the maximum power direction detection unit 224 by a time corresponding to the time of detection of the specific sound by the specific sound detection unit 21 and then inputs the delay to the weight setting unit 225. Thereby, it operates normally even if there is a delay in the detection of the specific sound.

[[第二実施形態の変形例１]]
図８に例示するように、音響信号処理装置は、推定頻度計測部２２７及び選択部２２８を更に備えていてもよい。 [[Modification 1 of the second embodiment]]
As illustrated in FIG. 8, the acoustic signal processing device may further include an estimated frequency measurement unit 227 and a selection unit 228.

この場合、最大パワー方向検出部２２４は、所定の閾値を超えるパワー方向全てを検出することにより、複数方向の同時推定が可能であってもよい。すなわち、最大パワー方向検出部２２４は、最大パワーの方向を検出し、検出済みの方向を除いて、さらに最大パワーの方向を検出する。最大パワー方向検出部２２４は、予め設定した最大推定方向数に達するか、最大パワーがあらかじめ設定した閾値以下になった場合に最大パワー検出を終了する。最大パワー方向検出部２２４は、例えばこのような方法により複数の音源の方向を同時に推定可能であってもよい。これにより、最大パワー方向検出部２２４は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能となる。 In this case, the maximum power direction detection unit 224 may be capable of simultaneous estimation in a plurality of directions by detecting all power directions exceeding a predetermined threshold. That is, the maximum power direction detection unit 224 detects the direction of the maximum power, and further detects the direction of the maximum power except for the detected direction. The maximum power direction detection unit 224 ends the maximum power detection when the preset maximum number of estimated directions is reached or the maximum power is equal to or less than a preset threshold value. The maximum power direction detection unit 224 may be capable of simultaneously estimating the directions of a plurality of sound sources by such a method, for example. Thereby, the maximum power direction detection unit 224 can estimate the directions of both sound sources when there is a noise source sound simultaneously with the specific sound.

この場合、どちらの音源で特定音が発せられたかの判別ができなくなってしまうので、推定頻度計測部２２７が、過去に方向推定がどのくらい行われたかで、その判別を行う。すなわち、推定頻度計測部２２７は、ＴＶ等の音源は常に音が出力されているので、過去に多数の方向推定が行われているものと考えられるので、これを手掛かりに判別する。 In this case, since it becomes impossible to determine which sound source has generated the specific sound, the estimation frequency measurement unit 227 performs the determination based on how much direction estimation has been performed in the past. That is, the estimated frequency measuring unit 227 determines that a large number of directions have been estimated in the past because sound is always output from a sound source such as a TV.

推定頻度計測部２２７は、過去の所定の時間区間における、方向推定部２２で推定された到来方向の頻度、言い換えれば、最大パワー方向検出部２２で選択された到来方向の頻度を計測する（ステップＳ１６）。すなわち、推定頻度計測部２２７は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部２２８に出力される。 The estimated frequency measuring unit 227 measures the frequency of the arrival direction estimated by the direction estimation unit 22 in a predetermined past time period, in other words, the frequency of the arrival direction selected by the maximum power direction detection unit 22 (step) S16). That is, the estimated frequency measuring unit 227 measures how often the direction is estimated within a certain past time. Information about the measured frequency is output to the selection unit 228.

例えば、過去Ｔ秒の間に、最大パワー方向検出部２２４の出力が方向θであった時間をA(θ)秒とすれば、θ方向の推定頻度は、それらの比D(θ)＝A(θ)/Ｔで求められる。推定頻度計測部２２７は、この頻度を各方向についてすべて求める。雑音源がテレビや音楽受聴用のスピーカであると想定した場合、長時間、ほとんど無音になることなく、同じ方向から音が発せられることになる。このような音源がθ方向にあった場合、推定頻度D(θ)は１に近い大きな値をとることになる。 For example, if the time during which the output of the maximum power direction detector 224 is in the direction θ during the past T seconds is A (θ) seconds, the estimated frequency in the θ direction is the ratio D (θ) = A It is obtained by (θ) / T. The estimated frequency measuring unit 227 calculates all the frequencies for each direction. If it is assumed that the noise source is a TV or a speaker for listening to music, the sound is emitted from the same direction with almost no silence for a long time. When such a sound source is in the θ direction, the estimated frequency D (θ) takes a large value close to 1.

選択部２２８は、推定頻度計測部２２７で計測された頻度の中で最も低い頻度の到来方向を選択する。例えば、選択部２２８は、最大パワー方向検出部２２の出力の推定方向が２個であった場合に、推定頻度D(θ)が小さい方を選択する。選択された到来方向は、重み設定部２２５に出力される。 The selection unit 228 selects the arrival direction with the lowest frequency among the frequencies measured by the estimated frequency measurement unit 227. For example, when there are two estimated directions of the output of the maximum power direction detection unit 22, the selection unit 228 selects the one having the smaller estimated frequency D (θ). The selected arrival direction is output to the weight setting unit 225.

なお、特定音検出部２１による特定音の検出に時間がかかる場合には、その時間に対応する時間だけ遅延させる遅延部２２６を最大パワー方向検出部２２４の後段に入れてもよい。図８では、遅延部２２６を破線で示している。遅延部２２６は、特定音検出部２１による特定音の検出の時間に対応する時間だけ最大パワー方向検出部２２４からの出力を遅延させてから重み設定部２２５に入力する。これにより、特定音の検出に遅延があっても正常に動作する。 When it takes time to detect a specific sound by the specific sound detection unit 21, a delay unit 226 that delays by a time corresponding to the time may be placed after the maximum power direction detection unit 224. In FIG. 8, the delay unit 226 is indicated by a broken line. The delay unit 226 delays the output from the maximum power direction detection unit 224 by a time corresponding to the time of detection of the specific sound by the specific sound detection unit 21 and then inputs the delay to the weight setting unit 225. Thereby, it operates normally even if there is a delay in the detection of the specific sound.

[[第二実施形態の変形例２]]
図９に例示するように、音響信号処理装置は、第二指向性集音部２４を更に備えていてもよい。 [[Modification 2 of the second embodiment]]
As illustrated in FIG. 9, the acoustic signal processing device may further include a second directional sound collection unit 24.

特定音検出部２１の処理の前に、第二指向性集音部２４による指向性集音を行うことで、より高精度な特定音の検出を行うことができる。 By performing the directional sound collection by the second directional sound collection unit 24 before the processing of the specific sound detection unit 21, it is possible to detect the specific sound with higher accuracy.

第二指向性集音部２４には、複数のマイクロホンで集音された信号を遅延させた信号が入力される。この遅延は、方向推定部２２による到来方向の推定処理に必要な時間に対応する時間の長さを持つ。この遅延は、図９に破線で示されている遅延部２５により行われる。また、第二指向性集音部２４には、方向推定部２２で推定された到来方向が入力される。 A signal obtained by delaying signals collected by a plurality of microphones is input to the second directivity sound collecting unit 24. This delay has a length of time corresponding to the time required for the arrival direction estimation processing by the direction estimation unit 22. This delay is performed by the delay unit 25 indicated by a broken line in FIG. Further, the arrival direction estimated by the direction estimation unit 22 is input to the second directivity sound collection unit 24.

第二指向性集音部２４は、方向推定部２２で推定された到来方向からの音が強調されるように集音を行う（ステップＳ２４）。より詳細には、第二指向性集音部２４は、複数のマイクロホンで集音された信号を遅延させた信号を用いて、方向推定部２２で推定された到来方向からの音が強調されるように集音を行う。第二指向性集音部２４で集音された信号は、特定音検出部２１に出力される。 The second directivity sound collecting unit 24 performs sound collection so that the sound from the direction of arrival estimated by the direction estimating unit 22 is emphasized (step S24). More specifically, the second directional sound collection unit 24 emphasizes the sound from the direction of arrival estimated by the direction estimation unit 22 using a signal obtained by delaying signals collected by a plurality of microphones. So that the sound is collected. The signal collected by the second directional sound collection unit 24 is output to the specific sound detection unit 21.

特定音検出部２１は、第二指向性集音部２４により集音された信号に基づいて特定音を検出する。その後の処理は、上記と同様である。 The specific sound detection unit 21 detects a specific sound based on the signal collected by the second directivity sound collection unit 24. The subsequent processing is the same as described above.

なお、図９に示すように、複数の第二指向性集音部２４が音響信号処理装置に備えられていてもよい。この場合、第二指向性集音部２４の数と同数の特定音検出部２１が音響信号処理装置に備えられている。 In addition, as shown in FIG. 9, the some 2nd directivity sound collection part 24 may be provided in the acoustic signal processing apparatus. In this case, the same number of specific sound detectors 21 as the number of second directivity sound collectors 24 are provided in the acoustic signal processing device.

この場合、方向推定部２２で複数の到来方向が推定された場合には、特定音検出部２１は、推定された複数の到来方向のそれぞれを強調するように動作し、それらの出力がそれぞれ複数の特定音検出部２１に入力され、特定音の検出が行われる。 In this case, when a plurality of arrival directions are estimated by the direction estimation unit 22, the specific sound detection unit 21 operates to emphasize each of the estimated plurality of arrival directions, and a plurality of outputs thereof are provided. The specific sound is input to the specific sound detection unit 21 and the specific sound is detected.

これにより、複数の特定音検出部２１で特定音が検出された場合に、優先順位を付けることが可能となる。 Thereby, when a specific sound is detected by a plurality of specific sound detectors 21, it is possible to give priority.

[[第二実施形態の変形例３]]
図１０に例示するように、第二実施形態の変形例２において、推定頻度計測部２６及び選択部２７を音響信号処理装置は更に備えていてもよい。この場合、方向推定部２２は、複数方向の同時推定が可能であってもよい。すなわち、方向推定部２２は、特定音と同時に雑音源の音もあった場合に、その両方の音源の方向が推定可能であってもよい。 [[Modification 3 of the second embodiment]]
As illustrated in FIG. 10, in Modification 2 of the second embodiment, the acoustic signal processing device may further include an estimated frequency measurement unit 26 and a selection unit 27. In this case, the direction estimation unit 22 may be capable of simultaneous estimation in a plurality of directions. That is, the direction estimating unit 22 may be able to estimate the directions of both sound sources when there is a sound of a noise source simultaneously with the specific sound.

推定頻度計測部２６及び選択部２７の処理は、第一実施形態の変形例１で説明したものと同様である。 The processes of the estimation frequency measurement unit 26 and the selection unit 27 are the same as those described in the first modification of the first embodiment.

すなわち、推定頻度計測部２６は、過去の所定の時間区間における、方向推定部２２で推定された到来方向の頻度を計測する（ステップＳ２６）。すなわち、推定頻度計測部２６は、過去一定時間内に、どのくらいの頻度で、その方向が推定されたかを計測する。計測された頻度についての情報は、選択部２７に出力される。 That is, the estimated frequency measuring unit 26 measures the frequency of the arrival direction estimated by the direction estimating unit 22 in a past predetermined time interval (step S26). That is, the estimated frequency measuring unit 26 measures how often the direction has been estimated within a certain past time. Information about the measured frequency is output to the selection unit 27.

例えば、過去Ｔ秒の間に、方向推定部２２の出力が方向θであった時間をA(θ)秒とすれば、θ方向の推定頻度は、それらの比D(θ)＝A(θ)/Ｔで求められる。推定頻度計測部２６は、この頻度を各方向についてすべて求める。雑音源がテレビや音楽受聴用のスピーカであると想定した場合、長時間、ほとんど無音になることなく、同じ方向から音が発せられることになる。このような音源がθ方向にあった場合、推定頻度D(θ)は１に近い大きな値をとることになる。 For example, if the time during which the output of the direction estimation unit 22 is in the direction θ during the past T seconds is A (θ) seconds, the estimated frequency in the θ direction is the ratio D (θ) = A (θ ) / T. The estimated frequency measuring unit 26 obtains all the frequencies in each direction. If it is assumed that the noise source is a TV or a speaker for listening to music, the sound is emitted from the same direction with almost no silence for a long time. When such a sound source is in the θ direction, the estimated frequency D (θ) takes a large value close to 1.

選択部２７は、推定頻度計測部２６で計測された頻度の中で最も低い頻度の到来方向を選択する（ステップＳ２７）。例えば、選択部２７は、方向推定部２２の出力の推定方向が２個であった場合に、推定頻度D(θ)が小さい方を選択する。特定音検出部２１で特定音が検出された時刻における、選択部２７で選択された到来方向は、方向推定部２２に出力され、方向推定部２２により推定された到来方向とされる。 The selection unit 27 selects the arrival direction with the lowest frequency among the frequencies measured by the estimated frequency measurement unit 26 (step S27). For example, when there are two estimated directions of the output of the direction estimating unit 22, the selecting unit 27 selects the one having the smaller estimated frequency D (θ). The arrival direction selected by the selection unit 27 at the time when the specific sound is detected by the specific sound detection unit 21 is output to the direction estimation unit 22 and is the arrival direction estimated by the direction estimation unit 22.

その後、第一指向性集音部２３は、上記と同様にして、方向推定部２２により推定された到来方向からの音が強調されるように集音を行う。 Thereafter, the first directivity sound collection unit 23 collects sound so that the sound from the arrival direction estimated by the direction estimation unit 22 is emphasized in the same manner as described above.

［第三実施形態］
第三実施形態の音響信号処理装置及び方法は、音響信号処理として音声区間の検出を行う。 [Third embodiment]
The acoustic signal processing apparatus and method according to the third embodiment detect a voice section as acoustic signal processing.

＜第三実施形態のポイント＞
本実施形態では、利用者の発話内容を絞り込むことで、利用環境（雑音など）の情報をより正しく得る。例えば、利用者が発話を始める前に特定の単語（キーワード）を発するように制限する。その際に、その特定の単語音声のみを高精度に検出できるようにしておき、「その区間は音声」「その前の区間は雑音」と仮定する。そして、その雑音区間と音声区間の音声を利用して、「音声／非音声」の判定のための情報を更新する。 <Points of third embodiment>
In the present embodiment, by narrowing down the user's utterance content, information on the usage environment (noise, etc.) can be obtained more correctly. For example, the user is restricted to utter a specific word (keyword) before starting to speak. At that time, it is assumed that only the specific word speech can be detected with high accuracy, and that “the section is speech” and “the previous section is noise”. Then, the information for the determination of “voice / non-voice” is updated using the voice of the noise section and the voice section.

そうすることで、その後に発せられる目的の音声の区間を判定する際に、より実利用環境に即した「雑音」と「音声」の情報が利用でき、区間検出の精度が向上する。 By doing so, when determining a section of a target speech to be subsequently issued, information on “noise” and “speech” more suited to the actual usage environment can be used, and the accuracy of section detection is improved.

以下、音響信号処理装置・方法の実施形態を説明する。音響信号処理装置は、例えば専用のハードウェアで構成された専用機やパーソナルコンピュータのような汎用機といったコンピュータで実現される。ここではコンピュータ（汎用機）で実現する場合として説明する。 Hereinafter, embodiments of the acoustic signal processing apparatus and method will be described. The acoustic signal processing apparatus is realized by a computer such as a dedicated machine configured by dedicated hardware or a general-purpose machine such as a personal computer. Here, description will be made on the case where it is realized by a computer (general-purpose machine).

音響信号処理装置のハードウェア構成例を説明する。 A hardware configuration example of the acoustic signal processing device will be described.

音響信号処理装置は、キーボード、ポインティングデバイスなどが接続可能な入力部と、液晶ディスプレイ、CRT（Cathode Ray Tube）ディスプレイなどが接続可能な出力部と、音響信号処理装置外部に通信可能な通信装置（例えば通信ケーブル、LANカード、ルータ、モデムなど）が接続可能な通信部と、CPU（Central Processing Unit）〔DSP（Digital Signal Processor）でも良い。またキャッシュメモリやレジスタなどを備えていてもよい。〕と、メモリであるRAM、ROMや、ハードディスク、光ディスク、半導体メモリなどである外部記憶装置並びにこれらの入力部、出力部、通信部、CPU、RAM、ROM、外部記憶装置間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、音響信号処理装置に、フレキシブルディスク、CD-ROM（Compact Disc Read Only Memory）、DVD（Digital Versatile Disc）などの記憶媒体を読み書きできる装置（ドライブ）などを設けるとしてもよい。 The acoustic signal processing device consists of an input unit to which a keyboard, pointing device, etc. can be connected, an output unit to which a liquid crystal display, a CRT (Cathode Ray Tube) display, etc. can be connected, and a communication device that can communicate outside the acoustic signal processing device ( For example, a communication unit to which a communication cable, a LAN card, a router, a modem, or the like) can be connected and a CPU (Central Processing Unit) [DSP (Digital Signal Processor) may be used. A cache memory, a register, or the like may be provided. ] RAM, ROM, which is a memory, external storage devices such as hard disks, optical disks, semiconductor memories, etc., and the exchange of data between these input units, output units, communication units, CPU, RAM, ROM, external storage devices It has a bus that connects as possible. If necessary, the acoustic signal processing device may be provided with a device (drive) capable of reading and writing storage media such as a flexible disk, a CD-ROM (Compact Disc Read Only Memory), and a DVD (Digital Versatile Disc).

また、音響信号処理装置には、例えば音声、音楽、雑音などの音を受音する音響信号収音手段（例えばマイクロホン）を接続可能であって、マイクロホンによって得られた（アナログ）信号の入力を受ける信号入力部、および、再生信号を音として出力する音響出力装置（例えばスピーカ）を接続可能であって、スピーカに入力する信号（再生信号をＤ／Ａ変換したもの）を出力するための信号出力部を設ける構成とすることも可能である。この場合、信号入力部にはマイクロホンが接続され、信号出力部にはスピーカが接続する。 The acoustic signal processing apparatus can be connected to an acoustic signal collecting means (for example, a microphone) that receives sound such as voice, music, and noise, and inputs an (analog) signal obtained by the microphone. A signal input unit for receiving and a sound output device (for example, a speaker) that outputs a reproduction signal as sound can be connected, and a signal for outputting a signal (a D / A converted version of the reproduction signal) input to the speaker A configuration in which an output unit is provided is also possible. In this case, a microphone is connected to the signal input unit, and a speaker is connected to the signal output unit.

音響信号処理装置の外部記憶装置には、音声区間検出のためのプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている〔外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるROMに記憶させておくなどでもよい。〕。また、このプログラムの処理によって得られるデータなどは、RAMや外部記憶装置などに適宜に記憶される。以下、データやその格納領域のアドレスなどを記憶する記憶手段を単に「○○記憶部」と呼ぶことにする。 The external storage device of the acoustic signal processing device stores a program for detecting a voice section and data necessary for processing of the program [not limited to the external storage device, for example, the program is read using a read-only storage device. It may be stored in a certain ROM. ]. Further, data obtained by the processing of this program is appropriately stored in a RAM, an external storage device, or the like. Hereinafter, the storage means for storing the data, the address of the storage area, and the like will be simply referred to as “XX storage unit”.

この実施形態では、主記憶部に、音響信号に含まれる音声区間よりも時系列的に前の区間の信号を取得するために、離散信号である音響信号を記憶しておく。この記憶はバッファ等の一時的な記憶でもよい。 In this embodiment, an acoustic signal that is a discrete signal is stored in the main storage unit in order to acquire a signal in a section that is earlier in time series than the speech section included in the acoustic signal. This storage may be temporary storage such as a buffer.

＜音響信号処理装置の構成＞
図１３は第三実施形態に係る音響信号処理装置の機能ブロック図を、図１４はその処理フローを示す。 <Configuration of acoustic signal processing apparatus>
FIG. 13 is a functional block diagram of the acoustic signal processing apparatus according to the third embodiment, and FIG. 14 shows the processing flow.

音響信号処理装置は、音声区間検出部３２０と、音声区間検出情報蓄積部３３０とを含む。 The acoustic signal processing device includes a speech segment detection unit 320 and a speech segment detection information storage unit 330.

音響信号処理装置は、1つのマイクロホン３１０で収音された時系列音響信号と、特定音声区間検出部３４０の出力値とを入力とし、時系列音響信号に含まれる音声区間と非音声区間との少なくとも何れかを検出し、検出結果を出力する。 The acoustic signal processing device receives the time-series acoustic signal collected by one microphone 310 and the output value of the specific speech section detection unit 340, and inputs the speech section and the non-speech section included in the time-series acoustic signal. At least one of them is detected and a detection result is output.

なお、特定音声区間検出部３４０は、あらかじめ定められた音(以下「特定音」ともいう)が来たことを検知し、特定音の検出時刻を示す情報を出力する。本実施形態では、特定音は人が発する所定の音声であり、例えば、人が所定のキーワードを発した際の音声である。たとえば参考文献１のような「フレーズスポッティング」などの技術を利用して特定音声区間検出部３４０を実装することができる。
(参考文献１)「センサリ社音声技術説明」、[online]、2010年、[平成29年7月24日検索]、インターネット<URL:http://www.sensory.co.jp/Parts/Docs/SensoryTechnologyJP1003B.pdf>
なお、特定音の検出時刻を示す情報は、少なくとも特定音(例えばキーワード)を言い終わった時刻を示す情報であり、(1-i)特定音を言い終わった時刻そのものを出力してもよいし、(1-ii)特定音を言い終わった時刻に対応する時系列音響信号のフレーム番号を出力してもよいし、(1-iii)特定音を言い終わった時刻以外のフレーム時刻において検出していないことを示す情報(例えば「0」)を出力し、特定音を言い終わった時刻において検出したことを示す情報（例えば「1」）を出力することで特定音を言い終わった時刻を示す情報であってもよく、その他の特定音を言い終わった時刻を示す情報であってもよい。また、特定音の検出時刻を示す情報は、特定音を言い始めた時刻を示す情報を含んでもよく、(2-i)特定音を言い始めた時刻及び言い終わった時刻そのものを出力してもよいし、(2-ii)特定音を言い始めた時刻及び言い終わった時刻に対応する時系列音響信号のフレーム番号を出力してもよいし、(2-iii)特定音を言い始めた時刻から言い終わった時刻までにおいて検出したことを示す情報（例えば「1」）を出力し、それ以外の時刻において検出していないことを示す情報(例えば「0」)を出力することで特定音を言い終わった時刻を示す情報であってもよく、その他の特定音を言い終わった時刻を示す情報であってもよい。 The specific voice section detection unit 340 detects the arrival of a predetermined sound (hereinafter also referred to as “specific sound”), and outputs information indicating the detection time of the specific sound. In the present embodiment, the specific sound is a predetermined sound uttered by a person, for example, a sound when a person utters a predetermined keyword. For example, the specific speech section detection unit 340 can be implemented using a technique such as “phrase spotting” as in Reference 1.
(Reference 1) “Sensory's Voice Technology Description” [online], 2010, [searched July 24, 2017], Internet <URL: http://www.sensory.co.jp/Parts/Docs /SensoryTechnologyJP1003B.pdf>
Note that the information indicating the detection time of the specific sound is information indicating at least the time when the specific sound (for example, a keyword) is finished, and (1-i) the time when the specific sound is finished may be output. (1-ii) The frame number of the time-series sound signal corresponding to the time when the specific sound is finished may be output, or (1-iii) it is detected at a frame time other than the time when the specific sound is finished. Information indicating that the specific sound has not been output (for example, “0”), and information indicating that the specific sound has been detected (for example, “1”) is output to indicate the time when the specific sound has been ended It may be information, or information indicating the time when the other specific sound is finished. Further, the information indicating the detection time of the specific sound may include information indicating the time when the specific sound is started, or (2-i) the time when the specific sound is started and the time when the specific sound is finished may be output. (2-ii) The time when the specific sound started and the frame number of the time-series sound signal corresponding to the time when the specific sound ended may be output, or (2-iii) the time when the specific sound started To output information (eg, “1”) indicating that it has been detected up to the time when it is finished, and to output information (eg, “0”) indicating that it has not been detected at other times. It may be information indicating the time when the user has finished speaking, or may be information indicating the time when the other specific sound is ended.

以下、各部の処理内容を説明する。 Hereinafter, the processing content of each part is demonstrated.

＜音声区間検出情報蓄積部３３０＞
音声区間検出情報蓄積部３３０は、特定音の検出時刻を示す情報と時系列音響信号とを入力とし、フレーム単位で特定音音声区間に対応する時系列音響信号の特徴量と、非音声区間に対応する時系列音響信号の特徴量とを求め（Ｓ３３０）、出力する。なお、音声区間検出情報蓄積部３３０を含む各部において各処理はフレーム単位で行われる。 <Audio section detection information storage unit 330>
The voice section detection information storage unit 330 receives information indicating the detection time of a specific sound and a time-series sound signal as inputs, and features a time-series sound signal corresponding to the specific sound voice section in units of frames and a non-voice section. The feature amount of the corresponding time-series acoustic signal is obtained (S330) and output. In each unit including the audio section detection information storage unit 330, each process is performed in units of frames.

図１５に示すように、音声区間検出情報蓄積部３３０は、音声蓄積部３３１と、特定音音声区間算出部３３２と、特徴量算出部３３３とを含む。以下、各部の処理内容を説明する。 As shown in FIG. 15, the speech segment detection information storage unit 330 includes a speech storage unit 331, a specific sound speech segment calculation unit 332, and a feature amount calculation unit 333. Hereinafter, the processing content of each part is demonstrated.

（音声蓄積部３３１）
音声蓄積部３３１は、音声区間検出対象の時系列音響信号を受け取り、蓄積する。 (Voice storage unit 331)
The voice accumulation unit 331 receives and accumulates time-series acoustic signals to be detected as voice segments.

（特定音音声区間算出部３３２）
特定音音声区間算出部３３２は、特定音の検出時刻を示す情報を入力とし、検出時刻に基づき特定音に対応する区間と推定される時系列音響信号の区間を特定音音声区間とし、検出時刻に基づき特定音に対応する区間ではないと推定される時系列音響信号の区間を非音声区間と判定し、特定音音声区間を示す情報、非音声区間を示す情報を出力する。例えば、特定音の検出時刻(この例では、特定音を言い終わった時刻)の前のt₁秒間を特定音音声区間とし、特定音音声区間の前のt₂秒間を非音声区間と判定する(図１６参照)。 (Specific sound voice section calculation unit 332)
The specific sound speech section calculation unit 332 receives information indicating the detection time of the specific sound as an input, sets a section of a time-series acoustic signal estimated as a section corresponding to the specific sound based on the detection time as a specific sound sound section, and detects the detection time The time-series acoustic signal section estimated not to be a section corresponding to the specific sound is determined as a non-voice section, and information indicating the specific sound voice section and information indicating the non-voice section are output. For example, t ₁ second before the specific sound detection time (in this example, the time when the specific sound is finished) is defined as the specific sound voice section, and t ₂ seconds before the specific sound voice section is determined as the non-voice section. (See FIG. 16).

例えば、特定音の検出時刻を示す情報として、特定音を言い終わったフレーム時刻(例えばtとする)を示す情報のみを含む場合、t₁、t₂を予め所定の値にそれぞれ設定しておき、特定音の検出時刻を示す情報から特定音音声区間(t-t₁からtまで)と非音声区間(t-t₁-t₂からt-t₁まで)とを求める。t₁としては特定音を発した際にかかる時間の平均値等を用いてもよい。また、特定音の検出時刻を示す情報として、特定音を言い始めた時刻及び言い終わった時刻(例えばtとする)を示す情報を含む場合、特定音を言い始めた時刻をt-t₁とし、特定音音声区間を特定音を言い始めた時刻t-t₁から言い終わった時刻tまでとする。また、t₂を予め所定の値に設定しておき、所定の値t₂と、特定音を言い始めた時刻t-t₁とから非音声区間(t-t₁-t₂からt-t₁まで)を求める。 For example, if the information indicating the detection time of the specific sound includes only information indicating the frame time when the specific sound is finished (for example, t), t ₁ and t ₂ are set to predetermined values in advance. Then, the specific sound speech section (from tt ₁ to t) and the non-speech section (from tt ₁ -t ₂ to tt ₁ ) are obtained from the information indicating the detection time of the specific sound. As t ₁ , an average value of time taken when a specific sound is emitted may be used. In addition, when the information indicating the specific sound detection time includes information indicating the time when the specific sound is started and the time when the specific sound is ended (for example, t), the time when the specific sound is started is defined as tt ₁ The sound voice section is defined as from the time tt _{1 at} which the specific sound starts to the time t at which the specific sound ends. In addition, t ₂ is set to a predetermined value in advance, and a non-speech interval (from tt ₁ -t ₂ to tt ₁ ) is obtained from the predetermined value t ₂ and the time tt _{1 at} which the specific sound is started.

（特徴量算出部３３３）
特徴量算出部３３３は、特定音音声区間算出部３３２から特定音音声区間を示す情報、非音声区間を示す情報を受け取り、音声蓄積部３３１に蓄積された音声区間検出対象の時系列音響信号を受け取る。そして、特徴量算出部３３３は、時系列音響信号と特定音音声区間とを対応付け、時系列音響信号と非音声区間とを対応付け、特定音音声区間に対応する時系列音響信号からその特徴量である音声区間特徴量を算出し、非音声区間に対応する時系列音響信号からその特徴量である非音声区間特徴量を算出し、音声区間特徴量及び非音声区間特徴量を出力する。特徴量としては、例えば、対数メルスペクトルやケプストラム係数などを用いることができる。但し、第二音響信号分析部３２２が用いる音響特徴量（基本周波数）以外の音響特徴量とするのがよい。特徴量の算出方法としては、どのような方法を用いてもよい。例えば、参考文献４に記載の方法を用いる。
(参考文献４)特開２００９−６３７００号公報 (Feature amount calculation unit 333)
The feature amount calculation unit 333 receives information indicating the specific sound voice interval and information indicating the non-voice interval from the specific sound voice interval calculation unit 332, and receives the time-series acoustic signal to be detected as the voice interval stored in the voice storage unit 331. receive. Then, the feature amount calculation unit 333 associates the time-series acoustic signal with the specific sound speech section, associates the time-series sound signal with the non-speech section, and features the time-series sound signal corresponding to the specific sound speech section. A speech segment feature value that is a quantity is calculated, a non-speech segment feature value that is a feature value is calculated from a time-series acoustic signal corresponding to the non-speech segment, and a speech segment feature value and a non-speech segment feature value are output. As the feature amount, for example, a log mel spectrum or a cepstrum coefficient can be used. However, it is good to set it as acoustic feature-values other than the acoustic feature-value (fundamental frequency) which the 2nd acoustic signal analysis part 322 uses. Any method may be used as the feature amount calculation method. For example, the method described in Reference 4 is used.
(Reference 4) JP 2009-63700 A

＜音声区間検出部３２０＞
音声区間検出部３２０は、マイクロホン３１０から時系列音響信号を受け取り、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、音声区間特徴量から音声区間の特徴を示す音声パラメータを求め、非音声区間特徴量から非音声区間の特徴を示す非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。 <Audio section detection unit 320>
The voice segment detection unit 320 receives a time-series acoustic signal from the microphone 310 and receives a voice segment feature value and a non-speech segment feature value from the feature value calculation unit 333. The speech segment detection unit 320 obtains a speech parameter indicating the feature of the speech segment from the speech segment feature value, obtains a non-speech parameter indicating the feature of the non-speech segment from the non-speech segment feature value, and obtains the speech parameter and the non-speech parameter. Using it, at least one of a speech section and a non-speech section is detected from the time-series acoustic signal (S320), and a detection result is output.

例えば、音声区間検出部３２０は、音声区間を推定する際に用いられる音響モデルのパラメータである音声パラメータを音声区間特徴量から求め、非音声区間を推定する際に用いられる音響モデルのパラメータである非音声パラメータを非音声区間特徴量から求める。 For example, the speech segment detection unit 320 obtains a speech parameter, which is a parameter of an acoustic model used when estimating a speech segment, from the speech segment feature, and is a parameter of the acoustic model used when estimating a non-speech segment. The non-speech parameter is obtained from the non-speech segment feature.

例えば、音声区間検出部３２０に参考文献４の音声区間検出装置を利用することができる。この場合、音声パラメータは音声GMMのパラメータであり、非音声パラメータは非音声GMMのパラメータである。 For example, the speech segment detection device of Reference 4 can be used for the speech segment detection unit 320. In this case, the voice parameter is a parameter of the voice GMM, and the non-voice parameter is a parameter of the non-voice GMM.

図１７に示すように、音声区間検出部３２０は、入力の時系列音響信号に対して並列カルマンフィルタ／並列カルマンスムーザを用いて確率計算を行う第一音響信号分析部３２１と、時系列音響信号の周期性成分と非周期性成分の比を用いて確率計算を行う第二音響信号分析部３２２と、それぞれの確率の重みを計算する重み算出部３２３と、算出された重みを用いて、時系列音響信号が音声状態に属する合成確率と非音声状態に属する合成確率を算出し、それぞれの比を求める音声状態／非音声状態合成確率比算出部３２４と、音声状態／非音声状態合成確率比に基づき音声／非音声識別を行う音声区間推定部３２５とを含む。なお、第一音響信号分析部３２１以外の構成については、参考文献４と同様の処理を行うため説明を省略する。 As shown in FIG. 17, the speech section detection unit 320 includes a first acoustic signal analysis unit 321 that performs probability calculation on an input time-series acoustic signal using a parallel Kalman filter / parallel Kalman smoother, and a time-series acoustic signal. A second acoustic signal analysis unit 322 that performs probability calculation using the ratio of the periodic component to the non-periodic component, a weight calculation unit 323 that calculates the weight of each probability, and the calculated weight to A speech state / non-speech state synthesis probability ratio calculation unit 324 that calculates a synthesis probability that a sequence acoustic signal belongs to a speech state and a synthesis probability that belongs to a non-speech state and obtains a ratio between them, and a speech state / non-speech state synthesis probability ratio And a speech segment estimation unit 325 that performs speech / non-speech discrimination based on. In addition, about the structure other than the 1st acoustic signal analysis part 321, since it performs the process similar to the reference document 4, description is abbreviate | omitted.

第一音響信号分析部３２１へ入力される時系列音響信号は、例えば8,000Hzのサンプリングレートでサンプリングされ、離散信号に変換された音響信号である。この音響信号は、目的信号である音声信号に雑音信号が重畳した音となっている。以下、音響信号を「入力信号」、音声信号を「クリーン音声」、雑音信号を「雑音」と呼ぶ。 The time-series acoustic signal input to the first acoustic signal analysis unit 321 is an acoustic signal that is sampled at a sampling rate of, for example, 8,000 Hz and converted into a discrete signal. This acoustic signal is a sound in which a noise signal is superimposed on an audio signal that is a target signal. Hereinafter, an acoustic signal is referred to as an “input signal”, an audio signal is referred to as “clean audio”, and a noise signal is referred to as “noise”.

音声区間検出部３２０は、入力信号、音声区間特徴量及び非音声区間特徴量を受けて、音声区間検出結果を出力する。音声区間検出結果は、フレーム単位の音響信号が音声状態に属すれば１を、非音声状態に属すれば０を取る。音声区間検出部３２０は、音声区間検出結果の値を入力信号にかけ合わせた信号を出力してもよい。すなわち、音声状態に属するフレームの入力信号の値は保持され、非音声状態に属するフレームでは、信号の値が全て０に置換される。 The speech segment detection unit 320 receives the input signal, speech segment feature value, and non-speech segment feature value, and outputs a speech segment detection result. The speech section detection result is 1 if the acoustic signal in units of frames belongs to the speech state, and takes 0 if it belongs to the non-speech state. The voice segment detection unit 320 may output a signal obtained by multiplying the value of the voice segment detection result by the input signal. In other words, the value of the input signal of the frame belonging to the voice state is retained, and all the signal values are replaced with 0 in the frame belonging to the non-voice state.

＜第一音響信号分析部３２１＞
第一音響信号分析部３２１は、図１８に示すように、入力信号、音声区間特徴量及び非音声区間特徴量を受けて、音声区間検出に用いる音響特徴量を抽出するための特徴量算出部３２１１と、確率モデルパラメータを推定し、得られた確率モデルパラメータにより構成される確率モデルを用いた入力信号の確率計算を行うための、確率推定部３２１２とを含む。 <First acoustic signal analysis unit 321>
As shown in FIG. 18, the first acoustic signal analysis unit 321 receives an input signal, a speech segment feature value, and a non-speech segment feature value, and extracts a feature value calculation unit for extracting an acoustic feature value used for speech segment detection. 3211 and a probability estimation unit 3212 for estimating a probability model parameter and performing a probability calculation of an input signal using a probability model constituted by the obtained probability model parameter.

（特徴量算出部３２１１）
特徴量算出部３２１１は、特徴量算出部３３３と同様の方法により、入力信号からその特徴量を算出し、出力する。例えば、24次元の対数メルスペクトルを要素に持つベクトルG_t={g_t,0,…,g_t,φ,…,g_t,23}を算出し、これを出力する。ベクトルG_tは、切り出しの始点の時刻がtのフレームにおける音響特徴量を表す。φはベクトルの要素番号を示す。以下、tをフレーム時刻と呼ぶことにする。 (Feature amount calculation unit 3211)
The feature amount calculation unit 3211 calculates the feature amount from the input signal and outputs it by the same method as the feature amount calculation unit 333. For example, a vector G _t = {gt _{, 0} ,..., Gt _{, φ} ,..., Gt _{, 23} } having a 24-dimensional log mel spectrum as an element is calculated and output. The vector G _t represents the acoustic feature amount in the frame whose start point is t. φ indicates the element number of the vector. Hereinafter, t is referred to as a frame time.

（確率推定部３２１２）
特徴量算出部３２１１の出力である24次元の対数メルスペクトルは、確率推定部３２１２の入力となる。確率推定部３２１２は、入力されたフレームに対して並列非線形カルマンフィルタ、および並列カルマンスムーザを適用し、雑音パラメータを推定する。推定された雑音パラメータを用いて、非音声（雑音＋無音）、および、音声（雑音＋クリーン音声）の確率モデルを生成し、対数メルスペクトルを各確率モデルに入力した際の確率を計算する。 (Probability estimation unit 3212)
The 24-dimensional log mel spectrum, which is the output of the feature quantity calculation unit 3211, is input to the probability estimation unit 3212. The probability estimation unit 3212 applies a parallel nonlinear Kalman filter and a parallel Kalman smoother to the input frame to estimate a noise parameter. Using the estimated noise parameters, probabilistic models of non-speech (noise + silence) and speech (noise + clean speech) are generated, and the probability when the log mel spectrum is input to each probability model is calculated.

確率推定部３２１２は図１９に示すように、前向き推定部３２１２−１と、後ろ向き推定部３２１２−２と、GMM（Gaussian Mixture Model）記憶部３２１２−３と、パラメータ記憶部３２１２−４を含む。なお、後ろ向き推定部３２１２−２については、参考文献４と同様の処理を行うため説明を省略する。 As shown in FIG. 19, the probability estimation unit 3212 includes a forward estimation unit 3212-1, a backward estimation unit 3212-2, a GMM (Gaussian Mixture Model) storage unit 3212-3, and a parameter storage unit 3212-4. Note that the backward estimation unit 3212-2 performs the same processing as in the reference document 4, and thus the description thereof is omitted.

GMM記憶部３２１２−３は、あらかじめ用意した無音信号とクリーン音声信号の各音響モデルである無音GMMおよびクリーン音声GMMを記憶する。以下、無音GMMおよびクリーン音声GMMを単にGMMなどと表記する。GMMの構成方法は公知の技術であるので、説明を省略する。GMMはそれぞれ複数の正規分布（たとえば３２個）を含有しており、それぞれの正規分布は、混合重みｗ_j,k 、平均μ_S,j,k,φ、分散Σ_S,j,k,φをパラメータとして構成され、jはGMMの種別（j=0：無音GMM，j=1：クリーン音声GMM）、kは各正規分布の番号を示す。各パラメータは、前向き推定部３２１２−１と後向き推定部３２１２−２への入力となる。 The GMM storage unit 3212-3 stores a silence GMM and a clean sound GMM, which are acoustic models of a silence signal and a clean sound signal prepared in advance. Hereinafter, the silent GMM and the clean voice GMM are simply referred to as GMM or the like. Since the GMM configuration method is a known technique, a description thereof will be omitted. Each GMM contains a plurality of normal distributions (for example, 32), and each normal distribution has a mixture weight w _{j, k} , mean μ _{S, j, k, φ} , variance Σ _{S, j, k, φ} Where j is the GMM type (j = 0: silent GMM, j = 1: clean speech GMM), and k is the number of each normal distribution. Each parameter becomes an input to the forward estimation unit 3212-1 and the backward estimation unit 3212-2.

パラメータ記憶部３２１２−４は、初期雑音モデル推定用バッファと、雑音モデル推定用バッファとを含む。 The parameter storage unit 3212-4 includes an initial noise model estimation buffer and a noise model estimation buffer.

［前向き推定部３２１２−１］
前向き推定部３２１２−１における処理内容が参考文献４とは異なる。 [Forward estimation unit 3212-1]
The processing content in the forward estimation unit 3212-1 is different from that in Reference Document 4.

参考文献４では、前向き推定部において雑音モデルのパラメータ^N_t,j,k,φ、^Σ_N,t,j,k,φを処理の開始時刻から逐次更新で求めていくが、入力されている音が音声か非音声(雑音)かは定めずに非音声・音声GMMのパラメータを更新している。それに対し、本実施形態では、非音声区間と音声区間とが判明しているため、その情報をより積極的に活用してパラメータを更新している。つまり、非音声区間の音声特徴量を利用して非音声GMMのパラメータを更新し、音声区間の音声特徴量を利用して音声GMMのパラメータを更新する。以下に処理例を示す。 In Reference 4, the forward estimation unit obtains the noise model parameters ^ N _{t, j, k, φ} and ^ Σ _{N, t, j, k, φ} by sequential updating from the processing start time. The parameters of the non-speech / speech GMM are updated without determining whether the sound is voice or non-speech (noise). On the other hand, in the present embodiment, since the non-speech section and the speech section are known, the information is updated more actively to update the parameters. That is, the parameters of the non-speech GMM are updated using the speech feature amount of the non-speech segment, and the parameters of the speech GMM are updated using the speech feature amount of the speech segment. A processing example is shown below.

まず、前向き推定部３２１２−１は、非音声区間に対応するフレーム時刻t-t₁-t₂からt-t₁までの特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1,φ}を用いて、非音声GMM(j=0)のパラメータを更新する。ただし、下付き添え字t_1、t_2はそれぞれｔ₁,t₂を意味する。 First, the forward estimation unit 3212-1 uses the feature quantities g _{t-t_1 -t_2, φ 1} ,..., G _{t-t_1, φ} from the frame times tt ₁ -t ₂ to tt ₁ corresponding to the non-speech section. The non-voice GMM (j = 0) parameter is updated. However, subscripts t_1 and t_2 mean t ₁ and t ₂ , respectively.

前向き推定部３２１２−１は、初期雑音モデル推定用バッファに、非音声区間特徴量(この例では対数メルスペクトルg_t,φとする)のうち、qフレーム分の非音声区間特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1-t_2-1+q-1,φ}を記憶する。ただし、qは非音声区間の長さt₂を超えない１以上の整数とし、例えばq=10とする。 The forward estimation unit 3212-1 stores q speech non-speech segment feature amount g _t− out of non-speech segment feature amount (in this example, log mel spectrum g _{t, φ} ) in the initial noise model estimation buffer. _{t_1-t_2, φ} ,..., g _{t-t_1-t_2-1 + q-1, φ} are stored. However, q is an integer of 1 or more that does not exceed the length t ₂ of the non-speech interval, for example, q = 10.

前向き推定部３２１２−１は、初期雑音モデル推定用バッファからqフレーム分の特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1-t_2-1+q-1,φ}を取り出す。初期の雑音モデルパラメータN^init _φ，Σ^init _N,φを下記各式で推定し、これらを雑音モデル推定用バッファに記憶する。 The forward estimation unit 3212-1 extracts q frame feature quantities g _{t-t — 1 — t — 2} ,..., G _{t — t — 1 — t — 2 + q−1, φ} from the initial noise model estimation buffer. The initial noise model parameters N ^init _φ and Σ ^init _{N, φ} are estimated by the following equations and stored in the noise model estimation buffer.

また、フレーム時刻t-t₁-t₂+qからt-t₁までの特徴量g_{t-t_1-t_2+q,φ}，…，g_{t-t_1,φ}を用いて、非音声GMM(j=0)のパラメータを更新する。なお、非音声GMMのパラメータの更新方法、更新式は参考文献４と同様である。 Also, using the feature quantities g _{t-t_1-t_2 + q, φ} ,..., G _{t-t_1, φ} from the frame times tt ₁ -t ₂ + q to tt ₁ , the non-voice GMM (j = 0) Update parameters. The non-voice GMM parameter update method and update formula are the same as in Reference Document 4.

次に、前向き推定部３２１２−１は、音声区間に対応するフレーム時刻t-t₁+1からtまでの特徴量g_{t-t_1+1,φ}，…，g_t,φを用いて、音声GMM(j=1)のパラメータを更新する。なお、非音声区間の最後のフレームを用いて更新したパラメータを、音声区間の最初のパラメータとする。つまり、 Next, the forward estimation unit 3212-1 uses the feature values g _{t-t_1 + 1, φ} ,..., G _{t, φ} from the frame times tt ₁ +1 to t corresponding to the speech section, to generate the speech GMM ( Update the parameter of j = 1). Note that the parameter updated using the last frame of the non-speech segment is the first parameter of the speech segment. That means

とする。さらに、特徴量g_{t-t_1+1,φ}，…，g_t,φを用いて、音声GMM(j=1)のパラメータを更新する。なお、音声GMMのパラメータの更新方法、更新式は参考文献４と同様である。 And Further, the parameters of the speech GMM (j = 1) are updated using the feature quantities g _{t-t_1 + 1, φ} ,..., G _{t, φ} . Note that the method and formula for updating the parameters of the voice GMM are the same as in Reference Document 4.

なお、フレーム時刻t以降は、従来技術と同様に、入力信号の特徴量を用いて、音声／非音声GMMのパラメータを更新する。 Note that after the frame time t, the parameters of the voice / non-voice GMM are updated using the feature amount of the input signal, as in the prior art.

音声区間検出部３２０は、非音声区間の音声特徴量を利用して更新した非音声GMMのパラメータと、音声区間の音声特徴量を利用して更新した音声GMMのパラメータとに基づき、フレーム時刻t以降において、入力信号の特徴量を用いて音声／非音声GMMのパラメータを更新し、その結果得られるパラメータを用いて音声／非音声を判定する。そのため、音声か非音声(雑音)かは定めずに非音声・音声GMMのパラメータを更新する従来技術と比較して、その判定精度を向上させることができる。 The speech segment detection unit 320 uses the frame time t based on the non-speech GMM parameter updated using the speech feature value of the non-speech segment and the speech GMM parameter updated using the speech feature value of the speech segment. Thereafter, the voice / non-voice GMM parameters are updated using the feature amount of the input signal, and the voice / non-voice is determined using the parameters obtained as a result. Therefore, the determination accuracy can be improved as compared with the prior art in which the parameters of the non-voice / voice GMM are updated without determining whether the voice or non-voice (noise).

なお、上述の処理は、最初に特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取ったときのみ行ってもよいし、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る度に行ってもよい。また、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る度に行う場合、毎回、(a)初期の雑音モデルパラメータＮ^init _φ，Σ^init _Ｎ,φを求める処理や(b)非音声区間の最後のフレームを用いて更新したパラメータを音声区間の最初のパラメータとする処理を含む全ての処理を繰り返してもよいし、2回目以降の処理においては上述の(a)や(b)の処理を行わずに音声区間特徴量と非音声区間特徴量とを受け取った時点のパラメータをそのまま用いて、非音声区間に対応するフレーム時刻t-t₁-t₂からt-t₁までの特徴量g_{t-t_1-t_2,φ}，…，g_{t-t_1,φ}を用いて非音声GMM(j=0)のパラメータを更新し、音声区間に対応するフレーム時刻t-t₁+1からtまでの特徴量g_{t-t_1,φ}，…，g_t,φを用いて、音声GMM(j=1)のパラメータを更新してもよい。 Note that the above-described processing may be performed only when the speech segment feature amount and the non-speech segment feature amount are first received from the feature amount calculation unit 333, or the speech segment feature amount and the non-speech function from the feature amount calculation unit 333. You may carry out whenever it receives an area feature-value. In addition, every time the speech section feature quantity and the non-speech section feature quantity are received from the feature quantity calculation unit 333, (a) a process for ^obtaining initial noise model parameters N ^init _φ and Σ ^init _{N, φ} ( b) All the processes including the process of using the parameter updated using the last frame of the non-speech section as the first parameter of the speech section may be repeated, and in the second and subsequent processes, the above (a) and Features from the frame times tt ₁ -t ₂ to tt ₁ corresponding to the non-speech segment using the parameters at the time of receiving the speech segment feature and the non-speech segment feature without performing the process of (b) Update the parameters of the non-speech GMM (j = 0) using the quantities g _{t-t_1-t_2, φ} , ..., g _{t-t_1, φ,} and the frame times tt ₁ +1 to t corresponding to the speech interval The parameters of the speech GMM (j = 1) may be updated using the feature quantities g _{t-t_1, φ} ,..., G _{t, φ} .

＜効果＞
以上の構成により、対象者(ユーザ)の特定の発話に対してキーワード検出を行った結果を利用して、目的音声を含む周囲の音響環境に関する情報をより正確に知ることができ、音声区間検出の信号処理が頑健になる。特に、認識したい音声と雑音とが近しい特性を持つ場合であっても、従来よりも高精度で音声区間と非音声区間との少なくとも何れかを検出することができる。 <Effect>
With the above configuration, using the result of keyword detection for a specific utterance of the target person (user), information about the surrounding acoustic environment including the target voice can be known more accurately, and voice segment detection The signal processing becomes robust. In particular, even when the speech to be recognized and the noise are close to each other, it is possible to detect at least one of the speech segment and the non-speech segment with higher accuracy than in the past.

なお、1つのマイクロホン３１０や特定音声区間検出部３４０を音響信号処理装置の一部としてもよい。また、本実施形態では、音声区間、非音声区間を推定する際に用いられる音響モデルとしてGMMを用いたが、HMM(Hidden Markov Model)等の他の音響モデルを用いてもよい。その場合にも、本実施形態と同様に、音声パラメータ、非音声パラメータをそれぞれ音声区間特徴量、非音声区間特徴量から求めればよい。 Note that one microphone 310 and the specific voice section detection unit 340 may be part of the acoustic signal processing device. In this embodiment, the GMM is used as the acoustic model used when estimating the speech section and the non-speech section. However, another acoustic model such as an HMM (Hidden Markov Model) may be used. Even in this case, as in the present embodiment, the speech parameter and the non-speech parameter may be obtained from the speech segment feature value and the non-speech segment feature value, respectively.

＜第三実施形態の第一変形例＞
第三実施形態と異なる部分を中心に説明する。 <First Modification of Third Embodiment>
A description will be given centering on differences from the third embodiment.

第三実施形態では、特徴量としては、対数メルスペクトルやケプストラム係数などを用いたが、他の特徴量を用いてもよい。本変形例では、より単純に音声のレベルを判定に用いる場合を考える。 In the third embodiment, a log mel spectrum, a cepstrum coefficient, or the like is used as the feature quantity, but other feature quantities may be used. In this modification, the case where the level of sound is used for determination will be considered more simply.

本実施形態では、特徴量として平均パワーを用いる。そのため、特徴量算出部３３３では、特定音音声区間に対応する時系列音響信号からその平均パワーを算出し音声区間特徴量として出力し、非音声区間に対応する時系列音響信号からその平均パワーを算出し非音声区間特徴量として出力する。 In the present embodiment, average power is used as the feature amount. Therefore, the feature amount calculation unit 333 calculates the average power from the time-series acoustic signal corresponding to the specific sound speech section and outputs the average power as the speech section feature amount, and calculates the average power from the time-series acoustic signal corresponding to the non-speech section. Calculate and output as non-speech segment feature.

＜音声区間検出部３２０＞
音声区間検出部３２０は、音声蓄積部３３１に蓄積された音声区間検出対象の時系列音響信号を受け取り、特徴量算出部３３３から音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、音声区間特徴量から音声区間の特徴を示す音声パラメータを求め、非音声区間特徴量から非音声区間の特徴を示す非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。 <Audio section detection unit 320>
The voice segment detection unit 320 receives the time-series acoustic signal to be detected by the voice segment stored in the voice storage unit 331, and receives the voice segment feature value and the non-speech segment feature value from the feature value calculation unit 333. The speech segment detection unit 320 obtains a speech parameter indicating the feature of the speech segment from the speech segment feature value, obtains a non-speech parameter indicating the feature of the non-speech segment from the non-speech segment feature value, and obtains the speech parameter and the non-speech parameter. Using it, at least one of a speech section and a non-speech section is detected from the time-series acoustic signal (S320), and a detection result is output.

図２０に示すように、音声区間検出部３２０は、音声パワー計算部３２６と、音声／非音声判定部３２７と、非音声レベル記憶部３２８と、音声レベル記憶部３２９とを含む。 As shown in FIG. 20, the voice section detection unit 320 includes a voice power calculation unit 326, a voice / non-voice determination unit 327, a non-voice level storage unit 328, and a voice level storage unit 329.

音声パワー計算部３２６は、音声蓄積部３３１に蓄積された音声区間検出対象の時系列音響信号を受け取り、時系列音響信号のフレームn毎の平均パワーP(n)を計算し、出力する。 The voice power calculation unit 326 receives the time-series acoustic signal to be detected in the voice section accumulated in the voice accumulation unit 331, calculates the average power P (n) for each frame n of the time-series acoustic signal, and outputs it.

例えば、
P(n)＞γV、かつ P(n)＞δN
を満たす場合に、その区間を音声区間と判定する方法が考えられる。nはフレーム時刻を表すインデックス、N,Vはそれぞれ非音声レベル記憶部３２８、音声レベル記憶部３２９に格納されている非音声区間のパワー閾値、音声区間のパワー閾値、γは0以上1以下、δは1以上の実数とする。音声区間の信号のレベルにある程度近い値(γV)より大きく、非音声区間(例えば雑音)の信号のレベルより十分大きい値(δN)よりも大きい場合に音声区間である、と判定する。この場合、あらかじめ格納してある非音声と音声の情報(V、N)と実際の音声区間、非音声区間の信号のレベルが異なる場合に正しく動作しない。またそれぞれの情報(V、N)を時系列音響信号に応じて逐次更新をしていくことも考えられるが、どの区間が非音声または音声かわからないまま更新をするため誤った方向へ値が更新されるリスクがある。 For example,
P (n)> γV and P (n)> δN
If the condition is satisfied, a method of determining the section as a speech section is conceivable. n is an index representing a frame time, N and V are non-voice level power thresholds stored in the non-voice level storage unit 328 and the voice level storage unit 329, power thresholds of the voice period, γ is 0 or more and 1 or less, δ is a real number of 1 or more. If it is greater than a value (γV) close to the level of the signal in the speech segment to some extent and greater than a value (ΔN) sufficiently greater than the level of the signal in the non-speech segment (for example, noise), it is determined that it is a speech segment. In this case, it does not operate correctly when the non-speech and speech information (V, N) stored in advance differs from the actual speech segment and non-speech segment signal levels. It is also possible to update each information (V, N) sequentially according to the time-series acoustic signal, but the value is updated in the wrong direction because it is updated without knowing which section is non-speech or speech. There is a risk of being.

本実施形態では、音声区間特徴量（音声区間の平均パワー）と非音声区間特徴量（非音声区間の平均パワー）とを用いて、パワー閾値V、Nを変更する。 In the present embodiment, the power thresholds V and N are changed using the speech segment feature value (average power of the speech segment) and the non-speech segment feature value (average power of the non-speech segment).

音声／非音声判定部３２７は、非音声レベル記憶部３２８、音声レベル記憶部３２９からそれぞれパワー閾値V、Nを取り出し、音声パワー計算部３２６から平均パワーP(n)を受け取り、特徴量算出部３３３から特定音音声区間に対応する時系列音響信号の平均パワーPvと非音声区間に対応する時系列音響信号の平均パワーPnとを受け取る。 The voice / non-voice determination unit 327 extracts the power thresholds V and N from the non-voice level storage unit 328 and the voice level storage unit 329, receives the average power P (n) from the voice power calculation unit 326, and receives the feature amount calculation unit. From 333, the average power Pv of the time-series acoustic signal corresponding to the specific sound speech section and the average power Pn of the time-series acoustic signal corresponding to the non-speech section are received.

音声／非音声判定部３２７は、パワー閾値V、Nを次式により、それぞれ平均パワーPv、Pnを考慮したパワー閾値V'、N'に置換える。
N’ = （1-α）N + αPn
V’ = （1-β）V + βPv
なおα、βは検出した音声・非音声区間の寄与率を決定するパラメータ（0<α<1、 0<β<1）を表す。音声／非音声判定部３２７は、
P(n)＞γV'、かつ P(n)＞δN'
を満たす場合に、そのフレームnに対応する区間を音声区間として検出し、満たさない場合に、そのフレームnに対応する区間を非音声区間として検出し、検出結果を出力する。 The voice / non-voice determination unit 327 replaces the power thresholds V and N with the power thresholds V ′ and N ′ considering the average powers Pv and Pn, respectively, by the following equations.
N '= (1-α) N + αPn
V '= (1-β) V + βPv
Α and β represent parameters (0 <α <1, 0 <β <1) for determining the contribution ratio of the detected speech / non-speech interval. The voice / non-voice determination unit 327
P (n)> γV 'and P (n)>δN'
If the condition is satisfied, the section corresponding to the frame n is detected as a speech section. If not satisfied, the section corresponding to the frame n is detected as a non-speech section, and the detection result is output.

本実施形態の場合、V'が音声区間の特徴を示す音声パラメータに相当し、N'が非音声区間の特徴を示す非音声パラメータに相当する。 In the present embodiment, V ′ corresponds to a speech parameter indicating a feature of a speech segment, and N ′ corresponds to a non-speech parameter indicating a feature of a non-speech segment.

＜効果＞
以上の構成により、より実際の状況に即したレベル判定が行うことができ、第三実施形態と同様の効果を得ることができる。 <Effect>
With the above configuration, level determination can be performed in accordance with the actual situation, and the same effect as in the third embodiment can be obtained.

＜第三実施形態の第二変形例＞
第三実施形態と異なる部分を中心に説明する。 <Second Modification of Third Embodiment>
A description will be given centering on differences from the third embodiment.

図１３は第三実施形態に係る音響信号処理装置の機能ブロック図を、図１４はその処理フローを示す。 FIG. 13 is a functional block diagram of the acoustic signal processing apparatus according to the third embodiment, and FIG. 14 shows the processing flow.

音響信号処理装置は、音声区間検出部３２０と、音声区間検出情報蓄積部３３０と、前処理部３５０とを含む。 The acoustic signal processing device includes a speech segment detection unit 320, a speech segment detection information storage unit 330, and a preprocessing unit 350.

＜前処理部３５０＞
前処理部３５０は、時系列音響信号を入力とし、時系列音響信号に含まれる音声を強調する処理（音声強調処理）を行い(Ｓ３５０)、強調後の時系列音響信号を出力する。音声強調処理としては、どのような方法を用いてもよい。例えば、参考文献２に記載の雑音抑圧方法を用いる。
（参考文献２）特開２００９−１１００１１号公報 <Pre-processing unit 350>
The pre-processing unit 350 receives the time-series acoustic signal as input, performs a process (speech enhancement process) for enhancing speech included in the time-series acoustic signal (S350), and outputs the enhanced time-series acoustic signal. Any method may be used as the speech enhancement processing. For example, the noise suppression method described in Reference 2 is used.
(Reference Document 2) Japanese Patent Laid-Open No. 2009-11001

＜効果＞
以上の構成により、第三実施形態と同様の効果を得ることができる。さらに、音声強調処理を施した時系列音響信号を用いて後段の処理（Ｓ３３０、Ｓ３２０）を行うことで、その検出精度を向上させることができる。 <Effect>
With the above configuration, the same effect as that of the third embodiment can be obtained. Furthermore, the detection accuracy can be improved by performing subsequent processing (S330, S320) using the time-series acoustic signal subjected to the speech enhancement processing.

＜第三実施形態の第三変形例＞
第三実施形態と異なる部分を中心に説明する。 <Third Modification of Third Embodiment>
A description will be given centering on differences from the third embodiment.

音響信号処理装置は、M個のマイクロホン３１０−ｍ(m=1,2,…,Mであり、Mは2以上の整数の何れか)でそれぞれ収音されたM個の時系列音響信号と、特定音声区間検出部３４０のL(Lは2以上の整数の何れか)個の出力値とを入力とし、時系列音響信号に含まれる音声区間と非音声区間との少なくとも何れかを検出し、検出結果を出力する。 The acoustic signal processing apparatus includes M time-series acoustic signals respectively collected by M microphones 310-m (m = 1, 2,..., M, and M is an integer of 2 or more). , L (L is any integer greater than or equal to 2) output values of the specific speech section detection unit 340 is input, and at least one of speech sections and non-speech sections included in the time-series acoustic signal is detected. The detection result is output.

図２１は第三変形例に係る音響信号処理装置の機能ブロック図を、図２２はその処理フローを示す。 FIG. 21 is a functional block diagram of the acoustic signal processing apparatus according to the third modification, and FIG. 22 shows the processing flow.

音響信号処理装置は、ビームフォーミング部３６０と、音声区間検出部３２０と、音声区間検出情報蓄積部３３０とを含む。 The acoustic signal processing device includes a beam forming unit 360, a speech segment detection unit 320, and a speech segment detection information storage unit 330.

＜ビームフォーミング部３６０＞
ビームフォーミング部３６０は、M個の時系列音響信号を入力とし、M個の時系列音響信号をL個の方向へそれぞれ指向性を高めたL個の時系列信号(時系列音響信号であり、例えばビームフォーミング出力信号)に変換し(Ｓ３６０)、特定音声区間検出部３４０、音声区間検出情報蓄積部３３０、音声区間検出部３２０に出力する。例えば、ビームフォーミング技術を用いてL個の時系列ビームフォーミング出力信号に変換する。ビームフォーミング技術としては、どのような方法を用いてもよい。例えば、参考文献３に記載の方法を用いる。
（参考文献３）特開２０１７−１０７１４１号公報 <Beam forming unit 360>
The beam forming unit 360 receives M time-series acoustic signals as input, and M time-series acoustic signals are L time-series signals (time-series acoustic signals that have increased directivity in L directions, respectively. For example, it is converted into a beamforming output signal) (S360), and is output to the specific speech segment detection unit 340, the speech segment detection information storage unit 330, and the speech segment detection unit 320. For example, it is converted into L time-series beamforming output signals using a beamforming technique. Any method may be used as the beam forming technique. For example, the method described in Reference 3 is used.
(Reference 3) Japanese Patent Application Laid-Open No. 2017-107141

なお、特定音声区間検出部３４０では、L個の時系列信号それぞれについて、特定音が来たことを検知し、特定音の検出時刻を示す情報を音声区間検出情報蓄積部３３０に出力する。なお、L個の時系列信号のうちの少なくとも１つの時系列信号に特定音が来たことを検知するものとし、特定音の検出時刻を示す情報は、検知した１つ以上のチャンネルを示す情報と、検知した１つ以上のチャンネルにそれぞれ対応する１つ以上の特定音の検出時刻を示す情報とを含む情報である。各特定音の検出時刻を示す情報は第三実施形態で説明した通りである。 The specific speech section detection unit 340 detects that a specific sound has arrived for each of the L time-series signals, and outputs information indicating the detection time of the specific sound to the speech section detection information storage unit 330. Note that it is detected that a specific sound has arrived in at least one time-series signal of L time-series signals, and information indicating the detection time of the specific sound is information indicating one or more detected channels. And information indicating detection times of one or more specific sounds respectively corresponding to the detected one or more channels. Information indicating the detection time of each specific sound is as described in the third embodiment.

＜音声区間検出情報蓄積部３３０＞
音声区間検出情報蓄積部３３０は、特定音の検出時刻を示す情報とL個の時系列信号とを入力とし、特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを求め（Ｓ３３０）、出力する。なお、特定音が検出されたチャンネル全てについて特徴量を求める。 <Audio section detection information storage unit 330>
The voice segment detection information storage unit 330 receives information indicating the detection time of a specific sound and L time-series signals as input, and obtains voice segment feature values and non-speech segment feature values of a channel in which the specific sound is detected. (S330) and output. It should be noted that feature amounts are obtained for all channels in which the specific sound is detected.

＜音声区間検出部３２０＞
音声区間検出部３２０は、L個の時系列信号を受け取り、特徴量算出部３３３から特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、特定音が検出されたチャンネル全ての音声区間特徴量から音声区間の特徴を示す1つの音声パラメータを求め、特定音が検出されたチャンネル全ての非音声区間特徴量から非音声区間の特徴を示す1つの非音声パラメータを求め、音声パラメータと非音声パラメータとを用いて、L個の時系列信号それぞれから音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。検出方法は第三実施形態で説明した通りである。本変形例では、L個の時系列信号に対して1つの(共通の)音声パラメータ及び1つの(共通の)非音声パラメータを用いる。 <Audio section detection unit 320>
The speech segment detection unit 320 receives L time-series signals, and receives the speech segment feature amount and the non-speech segment feature amount of the channel in which the specific sound is detected from the feature amount calculation unit 333. The voice section detection unit 320 obtains one voice parameter indicating the characteristics of the voice section from the voice section feature values of all the channels in which the specific sound is detected, and the non-voice section feature values of all the channels in which the specific sound is detected. One non-speech parameter indicating the characteristics of the speech segment is obtained, and at least one of the speech segment and the non-speech segment is detected from each of the L time-series signals using the speech parameter and the non-speech parameter (S320). The detection result is output. The detection method is as described in the third embodiment. In this modification, one (common) speech parameter and one (common) non-speech parameter are used for L time-series signals.

＜効果＞
このような構成により、第三実施形態と同様の効果を得ることができる。なお、ビームフォーミング部３６０を別装置とし、音響信号処理装置は、L個の時系列信号を入力とする構成としてもよい。また、L個の方向へそれぞれ指向性を高めたL個の指向性のマイクロホン３１０−ｍ(m=1,2,…,Lであり、Lは2以上の整数の何れか)でそれぞれ収音されたL個の時系列音響信号を入力とし、ビームフォーミング部３６０を用いない構成としてもよい。 <Effect>
With such a configuration, the same effect as that of the third embodiment can be obtained. The beam forming unit 360 may be a separate device, and the acoustic signal processing device may be configured to receive L time-series signals. In addition, each of the L directional microphones 310-m (m = 1, 2,..., L, where L is an integer greater than or equal to 2) with increased directivity in the L directions, respectively, collects sound. A configuration may be adopted in which the L time-series acoustic signals are input and the beam forming unit 360 is not used.

＜第三実施形態の第四変形例＞
第三変形例と異なる部分を中心に説明する。 <Fourth Modification of Third Embodiment>
A description will be given centering on differences from the third modification.

＜音声区間検出部３２０＞
音声区間検出部３２０は、L個の時系列信号を受け取り、特徴量算出部３３３から特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを受け取る。音声区間検出部３２０は、特定音が検出された1つのチャンネルの音声区間特徴量から音声区間の特徴を示す1つの音声パラメータを求め、特定音が検出された1つのチャンネルの非音声区間特徴量から非音声区間の特徴を示す1つの非音声パラメータを求め、特定音が検出されたチャンネル毎に求めた音声パラメータと非音声パラメータとを用いて、特定音が検出された時系列音響信号から音声区間と非音声区間との少なくとも何れかを検出し（Ｓ３２０）、検出結果を出力する。検出方法は第三実施形態で説明した通りである。 <Audio section detection unit 320>
The speech segment detection unit 320 receives L time-series signals, and receives the speech segment feature amount and the non-speech segment feature amount of the channel in which the specific sound is detected from the feature amount calculation unit 333. The speech segment detection unit 320 obtains one speech parameter indicating the feature of the speech segment from the speech segment feature value of one channel where the specific sound is detected, and the non-speech segment feature value of one channel where the specific sound is detected 1 non-speech parameter indicating the characteristics of the non-speech interval is obtained from the time-series acoustic signal from which the specific sound is detected using the sound parameter and the non-speech parameter obtained for each channel in which the specific sound is detected. At least one of the section and the non-voice section is detected (S320), and the detection result is output. The detection method is as described in the third embodiment.

本変形例ではL個の時系列信号にそれぞれ対応するL個の音声パラメータ及びL個の非音声パラメータを用いる。なお、音声区間検出部３２０は、特定音が検出されたチャンネルの音声区間特徴量と非音声区間特徴量とを受け取り、そのチャンネルの非音声パラメータ及び音声パラメータのみを求める。特定音が検出されなかったチャンネルについては、非音声パラメータ及び音声パラメータを求めず、特定音が検出されたタイミングでそのチャンネルに対応する非音声パラメータ及び音声パラメータを求める。 In this modification, L speech parameters and L non-speech parameters respectively corresponding to L time-series signals are used. Note that the speech segment detection unit 320 receives the speech segment feature amount and the non-speech segment feature amount of the channel in which the specific sound is detected, and obtains only the non-speech parameter and the speech parameter of the channel. For a channel for which a specific sound has not been detected, the non-speech parameter and the speech parameter are not obtained, but the non-speech parameter and the speech parameter corresponding to the channel are obtained at the timing when the specific sound is detected.

＜効果＞
このような構成により、第三実施形態と同様の効果を得ることができ、チャンネル毎に詳細な音声パラメータ、非音声パラメータを求めることができる。 <Effect>
With such a configuration, it is possible to obtain the same effects as in the third embodiment, and to obtain detailed audio parameters and non-audio parameters for each channel.

［補足］
音響信号処理装置は、予め定められた音である特定音を含む音響信号を入力とし、上記音響信号から上記特定音に対応する音響信号を除いた音響信号を雑音音響信号として、上記雑音音響信号と、上記特定音に対応する音響信号とを関連付けた音響信号処理を行う音響信号処理部を備えていると言える。 [Supplement]
The acoustic signal processing device receives an acoustic signal including a specific sound, which is a predetermined sound, as an input, and an acoustic signal obtained by removing an acoustic signal corresponding to the specific sound from the acoustic signal as a noise acoustic signal. And an acoustic signal processing unit that performs acoustic signal processing in association with the acoustic signal corresponding to the specific sound.

または、音響信号処理装置は、予め定められた音である特定音を含む音響信号を入力とし、上記特定音に対応する音響信号を対象音響信号として、上記対象音響信号と、上記音響信号から上記対象音響信号を除いた音響信号とを関連付けた音響信号処理を行う音響信号処理部を備えていると言える。 Alternatively, the acoustic signal processing device receives an acoustic signal including a specific sound that is a predetermined sound as an input, uses the acoustic signal corresponding to the specific sound as a target acoustic signal, and uses the target acoustic signal and the acoustic signal as described above. It can be said that the apparatus includes an acoustic signal processing unit that performs acoustic signal processing associated with acoustic signals excluding the target acoustic signal.

または、音響信号処理装置は、予め定められた音である特定音を含む音響信号を入力とし、上記音響信号から上記特定音に対応する音響信号を除いた音響信号を雑音音響信号とし、上記特定音に対応する音響信号を対象音響信号として、上記対象音響信号と、上記雑音音響信号とを関連付けた音響信号処理を行う音響信号処理部を備えていると言える。 Alternatively, the acoustic signal processing device receives an acoustic signal including a specific sound, which is a predetermined sound, and an acoustic signal obtained by removing an acoustic signal corresponding to the specific sound from the acoustic signal as a noise acoustic signal. It can be said that an acoustic signal processing unit that performs acoustic signal processing in which the target acoustic signal is associated with the noise acoustic signal using the acoustic signal corresponding to the sound as the target acoustic signal is provided.

音響信号処理部の例は、第一実施形態の変形例４の第三指向性集音部５２である。この場合、対象音響信号は方向記憶部１３から読み込んだ到来方向からの音の信号であり、雑音音響信号は雑音方向記憶部５１から読み込んだ到来方向からの音の信号となる。 An example of the acoustic signal processing unit is the third directional sound collecting unit 52 of Modification 4 of the first embodiment. In this case, the target acoustic signal is a sound signal from the arrival direction read from the direction storage unit 13, and the noise acoustic signal is a sound signal from the arrival direction read from the noise direction storage unit 51.

音響信号処理部の他の例は、第三実施形態の音声区間検出情報蓄積部３３０及び音声区間検出部３２０である。この場合、対象音響信号は特定音音声区間に対応する時系列音響信号であり、雑音音響信号は非音声区間に対応する時系列音響信号となる。 Other examples of the acoustic signal processing unit are the speech segment detection information storage unit 330 and the speech segment detection unit 320 of the third embodiment. In this case, the target acoustic signal is a time-series acoustic signal corresponding to the specific sound speech section, and the noise acoustic signal is a time-series acoustic signal corresponding to the non-speech section.

［プログラム及び記録媒体］
各音響信号処理装置の各部における処理をコンピュータによって実現する場合、これらの装置の各部がが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、その各部の処理がコンピュータ上で実現される。 [Program and recording medium]
When the processing in each unit of each acoustic signal processing device is realized by a computer, the processing contents of the functions that each unit of these devices should have are described by a program. Then, by executing this program on a computer, the processing of each part is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、各部の処理は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理の少なくとも一部をハードウェア的に実現することとしてもよい。 The processing of each unit may be configured by executing a predetermined program on a computer, or at least a part of these processing may be realized by hardware.

その他、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 Needless to say, other modifications are possible without departing from the spirit of the present invention.

Claims

The input is a time-series acoustic signal including a specific sound that is a predetermined sound emitted by a person, and information indicating a detection time of the specific sound included in the time-series acoustic signal,
The time series sound signal section estimated as the section corresponding to the specific sound based on the detection time is a specific sound voice section, and the time is estimated not to be a section corresponding to the specific sound based on the detection time A specific sound speech section calculating unit that determines a section of a sequence acoustic signal as a non-speech section;
A speech segment feature amount that is a feature amount is calculated from a time-series acoustic signal corresponding to the specific sound speech segment, and a non-speech segment feature amount that is a feature amount is calculated from a time-series acoustic signal corresponding to the non-speech segment. A feature amount calculation unit to
Obtaining a speech parameter indicating a feature of a speech segment from the speech segment feature value, obtaining a non-speech parameter indicating a feature of a non-speech segment from the non-speech segment feature value, and using the speech parameter and the non-speech parameter Including a speech segment detection unit that detects at least one of a speech segment and a non-speech segment from a time-series acoustic signal,
Acoustic signal processing device.

The acoustic signal processing device according to claim 1,
The specific sound voice interval calculation unit determines t1 seconds before the detection time as a specific sound voice interval, and determines t2 seconds before the specific sound voice interval as a non-voice interval,
Acoustic signal processing device.

The acoustic signal processing device according to claim 1 or 2,
The speech section detection unit obtains a speech parameter, which is a parameter of an acoustic model used when estimating a speech section, from the speech section feature amount, and is a parameter of an acoustic model used when estimating a non-speech section. A speech parameter is obtained from the non-speech segment feature.
Acoustic signal processing device.

The acoustic signal processing device according to claim 1 or 2,
The feature amount is an average power, the speech parameter is a power threshold considering the average power Pv of the time-series acoustic signal corresponding to the specific sound speech section, and the non-speech parameter is the time-series acoustic signal corresponding to the non-speech section. It is a power threshold considering the average power Pn,
The voice interval detection unit is based on the magnitude relationship between the voice parameter and the average power of the time-series acoustic signal, and the magnitude relationship between the non-voice parameter and the average power of the time-series acoustic signal. Detect at least one of non-speech intervals;
Acoustic signal processing device.

The acoustic signal processing device according to any one of claims 1 to 4,
Including a pre-processing unit that emphasizes speech included in the time-series acoustic signal,
Acoustic signal processing device.

The acoustic signal processing device according to any one of claims 1 to 5,
L time-series acoustic signals each having increased directivity in L directions including a specific sound that is a predetermined sound in at least one channel, and a detection time of the specific sound included in the time-series acoustic signal And the information indicating
The feature amount calculation unit calculates a speech section feature amount that is a feature amount from a time-series acoustic signal corresponding to the specific sound speech section of the channel in which the specific sound is detected, and the channel of the channel in which the specific sound is detected Calculate the feature amount of the non-speech segment that is the feature amount from the time-series acoustic signal corresponding to the non-speech segment,
The speech section detection unit obtains one speech parameter indicating a feature of a speech section from the speech section feature amount of all channels in which the specific sound is detected, and the non-speech section feature amount of all channels in which the specific sound is detected. Obtaining one non-speech parameter indicating the characteristics of the non-speech segment from, and detecting at least one of the speech segment and the non-speech segment from the time-series acoustic signal using the speech parameter and the non-speech parameter;
Acoustic signal processing device.

The acoustic signal processing device according to any one of claims 1 to 5,
L time-series acoustic signals each having increased directivity in L directions including a specific sound that is a predetermined sound in at least one channel, and a detection time of the specific sound included in the time-series acoustic signal And the information indicating
The feature amount calculation unit calculates a speech section feature amount that is a feature amount from a time-series acoustic signal corresponding to the specific sound speech section of the channel in which the specific sound is detected, and the channel of the channel in which the specific sound is detected Calculate the feature amount of the non-speech segment that is the feature amount from the time-series acoustic signal corresponding to the non-speech segment,
The speech section detection unit obtains one speech parameter indicating a feature of a speech section from the speech section feature amount of one channel where a specific sound is detected, and the non-speech section of one channel where the specific sound is detected One non-speech parameter indicating a feature of a non-speech section is obtained from the feature amount, and the time-series sound of the channel in which the specific sound is detected using the sound parameter and the non-speech parameter of the channel in which the specific sound is detected Detecting at least one of a speech segment and a non-speech segment from the signal;
Acoustic signal processing device.

The acoustic signal processing device uses a time-series acoustic signal including a specific sound that is a predetermined sound emitted by a person, and information indicating a detection time of the specific sound included in the time-series acoustic signal,
The time series sound signal section estimated as the section corresponding to the specific sound based on the detection time is a specific sound voice section, and the time is estimated not to be a section corresponding to the specific sound based on the detection time A specific sound speech section calculating step for determining a section of the sequence acoustic signal as a non-speech section;
A speech segment feature amount that is a feature amount is calculated from a time-series acoustic signal corresponding to the specific sound speech segment, and a non-speech segment feature amount that is a feature amount is calculated from a time-series acoustic signal corresponding to the non-speech segment. A feature amount calculating step,
Obtaining a speech parameter indicating a feature of a speech segment from the speech segment feature value, obtaining a non-speech parameter indicating a feature of a non-speech segment from the non-speech segment feature value, and using the speech parameter and the non-speech parameter A voice section detection step for detecting at least one of a voice section and a non-voice section from a time-series acoustic signal,
Acoustic signal processing method.

A program for causing a computer to function as the acoustic signal processing apparatus according to claim 1.