JP2012127701A

JP2012127701A - Device and method for sound detection

Info

Publication number: JP2012127701A
Application number: JP2010277461A
Authority: JP
Inventors: Akira Saso; 晃佐宗; Yasutaka Tanaka; 康貴田中; Shinichi Tanaka; 伸一田中; Masumi Tanimoto; 益巳谷本
Original assignee: National Institute of Advanced Industrial Science and Technology AIST; Sohgo Security Services Co Ltd
Current assignee: National Institute of Advanced Industrial Science and Technology AIST; Sohgo Security Services Co Ltd
Priority date: 2010-12-13
Filing date: 2010-12-13
Publication date: 2012-07-05
Anticipated expiration: 2030-12-13
Also published as: JP5652945B2

Abstract

PROBLEM TO BE SOLVED: To detect a specific sound in a noise environment with ease and high accuracy.SOLUTION: A signal power calculation part and an inclination calculation part calculate a feature value indicating sound data characteristics of collected observation sounds in time series. Meanwhile, an expectation value of the feature value is calculated, as score parameters, from each of a plurality of learning data which is the kind of sound as detection object but different from one another, in time series. A score calculation part calculates a score evaluating sound data based on a difference between the feature value calculated from the sound data and the expectation value of the feature value calculated from the learning data. A generation section detecting part detects positions of maximum and minimum values of the score and detects a specific sound generation section in sound data of the observation sounds based on the positions of the maximum and minimum values.

Description

本発明は、特定の種類の音を検出する音検出装置および音検出方法に関する。 The present invention relates to a sound detection apparatus and a sound detection method for detecting a specific type of sound.

従来から、警備において、特定の音に注目して異常事態の発生を検出することが行われている。例えば、監視エリア内でガラスの破壊音などの異常音を検知した際には、異常事態が発生したと判断することができる。また、異常音とは断定されないが、不審な物音などを検知した場合には、異常事態の発生か否かを判断する必要がある。このような特定の音を自動的に検知するためには、監視エリア内で観測される全ての観測音から、異常音や不審音そのもの、または、異常音や不審音の発生区間を検出する必要がある。以下では、特に記載のない限り、異常音および不審音を纏めて不審音と呼ぶ。 Conventionally, in security, detection of occurrence of an abnormal situation has been performed by paying attention to a specific sound. For example, when an abnormal sound such as a glass breaking sound is detected in the monitoring area, it can be determined that an abnormal situation has occurred. Moreover, although it is not determined that the sound is abnormal, it is necessary to determine whether or not an abnormal situation has occurred when a suspicious sound is detected. In order to automatically detect such a specific sound, it is necessary to detect the abnormal sound or the suspicious sound itself or the section where the abnormal sound or the suspicious sound is generated from all the observation sounds observed in the monitoring area. There is. In the following, unless otherwise specified, abnormal sounds and suspicious sounds are collectively referred to as suspicious sounds.

環境音などを含む音から特定音を検出する技術は、従来から提案されている。例えば、特許文献１には、音の信号パワーを用いて音声の発生区間を検出する技術が開示されている。特許文献１では、信号パワーに対して適切に閾値を設定することで、所定の音の発生区間を検出している。また例えば、特許文献２には、音信号のゼロクロス数を用いて特定の音の区間を検出する方法が開示されている。 Techniques for detecting specific sounds from sounds including environmental sounds have been proposed. For example, Patent Literature 1 discloses a technique for detecting a voice generation section using sound signal power. In Patent Literature 1, a predetermined sound generation section is detected by appropriately setting a threshold value for signal power. Further, for example, Patent Document 2 discloses a method of detecting a specific sound section using the number of zero crosses of a sound signal.

さらに例えば、特に音声に対して用いられる手法として、音を複数の周波数帯域に分割し、各帯域で求めた信号パワーに閾値を設定することで、所定の音の発生区間を検出する方法が知られている。この方法は、人間の声などの、特徴的な周波数帯域が予め分かっている音を抽出する場合に好適である。 Further, for example, as a technique used particularly for speech, a method for detecting a predetermined sound generation section by dividing a sound into a plurality of frequency bands and setting a threshold value for the signal power obtained in each band is known. It has been. This method is suitable for extracting a sound whose characteristic frequency band is known in advance, such as a human voice.

さらにまた、不審音の検出において、発生区間の検出を行わず、一定時間に採取される全ての音に対して音声認識処理を行う方法も考えられる。この方法では、例えば、採取された音信号に対する音声認識処理の開始および終了点を、ユーザが指定する。処理装置は、転送された音信号の、ユーザが指定した開始および終了点の間に対して音声認識処理を行い、ユーザは、音声認識処理の結果を用いて不審音の検出を行う。 Furthermore, in the detection of suspicious sound, a method of performing speech recognition processing on all sounds collected in a certain time without detecting the occurrence section is also conceivable. In this method, for example, the user designates the start and end points of voice recognition processing for the collected sound signal. The processing device performs voice recognition processing between the start and end points specified by the user of the transferred sound signal, and the user detects suspicious sound using the result of the voice recognition processing.

特許第２５２１４２５号公報Japanese Patent No. 2521425 特許第２９４４０９８号公報Japanese Patent No. 2944098

ところで、上述の特許文献１の技術を不審音の検出に用いた場合、集音された環境騒音および音声の信号に対して雑音（例えば工事の騒音など）が重畳されると、不審音の信号パワーに対して雑音の信号パワーが重畳されることになる。この場合には、不審音の検出漏れを起こしたり、不審音の発生区間を正しく検出できないおそれがあるという問題点があった。また、上述の特許文献２の技術に関しては、ゼロクロス数は雑音に影響され易く、環境音に対して雑音が重畳された場合、不審音の特徴を適切に捉えることが困難になるという問題点があった。 By the way, when the technique of Patent Document 1 described above is used for detection of suspicious sound, if noise (for example, construction noise) is superimposed on the collected environmental noise and voice signal, the signal of suspicious sound The noise signal power is superimposed on the power. In this case, there is a problem that the detection of the suspicious sound may be missed or the section where the suspicious sound is generated may not be detected correctly. In addition, with respect to the technique of the above-mentioned Patent Document 2, the number of zero crosses is easily affected by noise, and when noise is superimposed on environmental sound, it is difficult to appropriately capture the characteristics of suspicious sound. there were.

さらに、音を複数の周波数帯域に分割する方法では、検出対象とする音の信号パワーが特定の周波数帯域に集中していれば、雑音環境下での所定音の発生区間の検出に有効である。しかしながら、不審音は、実質的には物音であって、音の信号パワーが特定の周波数帯域に集中しているとは限らず、不審音の区間の検出には、必ずしも有効ではないという問題点があった。 Furthermore, the method of dividing a sound into a plurality of frequency bands is effective for detecting a predetermined sound generation section in a noise environment if the signal power of the sound to be detected is concentrated in a specific frequency band. . However, the suspicious sound is substantially a real sound, and the signal power of the sound is not necessarily concentrated in a specific frequency band, and is not necessarily effective for detecting the suspicious sound section. was there.

さらにまた、一定時間に採取される全ての音に対して音声認識処理を行う方法では、音声認識処理そのものによって異常音や不審音が検出できる訳ではないという問題点があった。例えば、この方法では、音声認識処理の結果で得られたパラメータの変化などをユーザが観察して、不審音といった特定の物音の検出を行う。 Furthermore, the method of performing voice recognition processing on all sounds collected in a certain time has a problem that abnormal sounds and suspicious sounds cannot be detected by the voice recognition processing itself. For example, in this method, a user observes a change in a parameter obtained as a result of the speech recognition process, and detects a specific sound such as a suspicious sound.

また、この方法を監視装置に適用する場合、例えば８時間〜１０時間という長時間に亘り、監視中の全ての観測音に対して音声認識処理を施す必要があり、音声認識処理に対する計算コストの面から考えて、現実的ではないという問題点があった。 In addition, when this method is applied to a monitoring device, it is necessary to perform speech recognition processing on all observation sounds being monitored over a long period of time, for example, 8 hours to 10 hours. From the aspect, there was a problem that was not realistic.

これに対し、複数の監視対象に対して１台の処理装置（サーバ）を設置することにより計算コストを抑えることも考えられる。しかしながら、この場合においても、複数の監視ポイントで採取された観測音のデータそれぞれを、常時、サーバに転送し続ける必要があり、通信コストの面から現実的ではないという問題点があった。 On the other hand, it is also conceivable to reduce the calculation cost by installing one processing device (server) for a plurality of monitoring targets. However, even in this case, there is a problem that it is necessary to always transfer each of the observation sound data collected at a plurality of monitoring points to the server, which is not realistic in terms of communication cost.

本発明は、上記に鑑みてなされたものであって、雑音環境下においても、容易且つ高精度に特定音を検出することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to detect a specific sound easily and with high accuracy even in a noisy environment.

上述した課題を解決し、目的を達成するために、本発明は、音データの特徴を示す特徴値を時系列に沿って算出する特徴値算出手段と、学習データから時系列に沿って予め求めた特徴値の期待値と、特徴値算出手段で算出された、音データの信号パワー時系列の特徴値との差分に基づいて音データを評価するスコアを算出するスコア算出手段と、スコアの極大値の位置と極小値の位置とを検出し、極大値の位置および極小値の位置に基づき音データ中の特定音発生区間を検出する検出手段とを備えることを特徴とする。 In order to solve the above-described problems and achieve the object, the present invention provides a feature value calculation means for calculating feature values indicating the characteristics of sound data along a time series, and is obtained in advance along the time series from learning data. Score calculating means for calculating a score for evaluating the sound data based on the difference between the expected value of the characteristic value and the characteristic value of the signal power time series of the sound data calculated by the feature value calculating means, and a maximum of the score And detecting means for detecting the position of the value and the position of the minimum value, and detecting a specific sound generation section in the sound data based on the position of the maximum value and the position of the minimum value.

また、本発明は、特徴値算出手段が、音データの特徴を示す特徴値を時系列に沿って算出する特徴値算出ステップと、スコア算出手段が、学習データから時系列に沿って予め求めた特徴値の期待値と、特徴値算出ステップで算出された、音データの信号パワー時系列の特徴値との差分に基づいて音データを評価するスコアを算出するスコア算出ステップと、スコアの極大値の位置と極小値の位置とを検出し、極大値の位置および極小値の位置に基づき音データ中の特定音発生区間を検出する検出ステップとを備えることを特徴とする。 Further, according to the present invention, the feature value calculating means calculates the feature value indicating the feature of the sound data along the time series, and the score calculating means previously obtained from the learning data along the time series. A score calculation step for calculating a score for evaluating sound data based on the difference between the expected value of the feature value and the signal power time series feature value of the sound data calculated in the feature value calculation step, and a maximum value of the score And a detection step of detecting a specific sound generation section in the sound data based on the position of the maximum value and the position of the minimum value.

本発明によれば、雑音環境下においても、容易且つ高精度に特定音を検出することができるという効果を奏する。 According to the present invention, it is possible to easily and accurately detect a specific sound even in a noisy environment.

図１は、本発明の実施形態に適用可能な音検出装置の一例の構成を概略的に示すブロック図である。FIG. 1 is a block diagram schematically showing a configuration of an example of a sound detection apparatus applicable to the embodiment of the present invention. 図２は、音検出装置の機能をより詳細に説明するための一例の機能ブロック図である。FIG. 2 is a functional block diagram of an example for explaining the function of the sound detection device in more detail. 図３は、スコアパラメータの算出方法についてより詳細に説明するための略線図である。FIG. 3 is a schematic diagram for explaining the score parameter calculation method in more detail. 図４は、学習データとしてガラス打撃音の音響データを用いた場合の、各フレームに関する信号パワー時系列の傾きの分布の例を示すヒストグラムである。FIG. 4 is a histogram showing an example of the distribution of the gradient of the signal power time series for each frame when the acoustic data of the glass impact sound is used as the learning data. 図５は、各フレームに関する信号パワー時系列の傾きの分布に基づき算出した、各フレームの期待値μおよび分散値σ²の例を示す略線図である。FIG. 5 is a schematic diagram illustrating an example of the expected value μ and the variance value σ ² of each frame calculated based on the distribution of the slope of the signal power time series regarding each frame. 図６は、信号パワーｙ_LP(ｉ)、信号パワー時系列の傾きｙ_GLP(ｉ)およびスコアＳ(ｉ)を、入力音響データの各フレームｉについて求めた値の例を示す略線図である。FIG. 6 is a schematic diagram showing an example of values obtained for each frame i of the input acoustic data for the signal power y _LP (i), the slope y _GLP (i) of the signal power time series, and the score S (i). is there. 図７は、学習データの各フレームｋのそれぞれについて求めた期待値μ_kおよび分散値σ² _kの例を示す略線図である。FIG. 7 is a schematic diagram illustrating an example of the expected value μ _k and the variance value σ ² _k obtained for each frame k of the learning data. 図８は、信号パワー、信号パワー時系列の傾きおよびスコアを、フレーム番号に対してプロットしたグラフである。FIG. 8 is a graph in which the signal power, the slope of the signal power time series, and the score are plotted against the frame number. 図９は、観測音に雑音が少ない場合の不審音発生区間の例を示す略線図である。FIG. 9 is a schematic diagram illustrating an example of a suspicious sound generation section when the observed sound has little noise. 図１０は、観測音に雑音が多い場合の不審音発生区間の例を示す略線図である。FIG. 10 is a schematic diagram illustrating an example of a suspicious sound generation section when there is a lot of noise in the observed sound. 図１１は、算出されたスコアの遅延を説明するための略線図である。FIG. 11 is a schematic diagram for explaining the delay of the calculated score. 図１２は、スコアの遅延の補正について説明するための略線図である。FIG. 12 is a schematic diagram for explaining correction of score delay. 図１３は、本実施形態による不審音発生区間の検出処理を示す一例のフローチャートである。FIG. 13 is a flowchart of an example showing the detection processing of the suspicious sound generation section according to the present embodiment. 図１４は、物音による学習データに対し、略ランダムな音響データが入力された場合の、信号パワー、信号パワー時系列の傾きおよびスコアの例を示す略線図である。FIG. 14 is a schematic diagram illustrating an example of signal power, signal power time series slope, and score when substantially random acoustic data is input to learning data based on a sound. 図１５は、物音による学習データに対し、略ランダムな音響データが入力された場合の、信号パワー、信号パワー時系列の傾きおよびスコアを、フレーム番号に対してプロットしたグラフである。FIG. 15 is a graph in which the signal power, the slope of the signal power time series, and the score are plotted with respect to the frame number when substantially random acoustic data is input with respect to the learning data based on the physical sound.

以下に添付図面を参照して、本発明に係る音検出装置の一実施形態を詳細に説明する。本発明の実施形態では、監視エリア内の音を観測し、観測された観測音の音響信号から不審音や異常音と見做される特定音の発生区間を検出する。そして、観測音の音響信号から、検出された特定音発生区間の音響信号を切り取って出力する。 Hereinafter, an embodiment of a sound detection device according to the present invention will be described in detail with reference to the accompanying drawings. In the embodiment of the present invention, the sound in the monitoring area is observed, and the generation period of the specific sound that is regarded as a suspicious sound or an abnormal sound is detected from the acoustic signal of the observed sound. Then, the detected sound signal of the specific sound generation section is cut out from the sound signal of the observation sound and output.

検出対象となる特定音は、人が発する音声とは異なる所謂物音であり、観測音に含まれる環境音と似ている。そのため、本実施形態では、概略的には、検出対象となる特定音と同種類の音の音響データによる学習データについて、予め特徴値の期待値を求め、求めた期待値と、観測音の時系列上の特徴値との差分を用いて算出したスコアに基づき特定音の発生区間を検出する。 The specific sound to be detected is a so-called sound that is different from the sound emitted by a person, and is similar to the environmental sound included in the observation sound. For this reason, in the present embodiment, generally, the expected value of the characteristic value is obtained in advance for the learning data based on the acoustic data of the same type of sound as the specific sound to be detected, and the obtained expected value and the time of the observation sound are determined. A specific sound generation interval is detected based on a score calculated using a difference from a feature value on the sequence.

ここで、観測音、環境音、不審音および異常音について定義する。観測音とは、監視エリア内で集音される全ての物音をいう。異常音は、侵入行動などが原因で発生した、警報出力すべき物音をいう。異常音の代表的な例としては、ガラスなどが破壊された際に発生する破壊音が挙げられる。不審音は、異常音と断定はされないが、監視エリア内で集音された怪しい物音をいう。不審音の例としては、打撃音などが考えられる。環境音は、観測音に含まれる、不審音および異常音以外の音をいう。環境音の例としては、風など自然現象に起因する音や、自動車、電車の音などが考えられる。 Here, the observation sound, environmental sound, suspicious sound and abnormal sound are defined. Observation sound refers to all sound collected in the surveillance area. An abnormal sound is a sound that should be output as an alarm, caused by intrusion behavior. A typical example of the abnormal sound is a breaking sound generated when glass or the like is broken. Suspicious sound is suspicious sound collected in the surveillance area, although it is not determined as abnormal sound. As an example of a suspicious sound, a striking sound can be considered. Environmental sound refers to sound other than suspicious sound and abnormal sound included in observation sound. Examples of environmental sounds include sounds resulting from natural phenomena such as wind, automobiles and trains.

図１は、本発明の実施形態に適用可能な音検出装置の一例の構成を概略的に示す。図１において、音検出装置１００は、Ａ／Ｄ変換部１１、演算部１２および記憶部１３を含む。例えば監視エリアにおいてマイクロフォン１０で集音された観測音は、Ａ／Ｄ変換部１１でディジタルデータに変換され、入力音響データ２０として演算部１２に供給される。 FIG. 1 schematically shows a configuration of an example of a sound detection apparatus applicable to the embodiment of the present invention. In FIG. 1, the sound detection device 100 includes an A / D conversion unit 11, a calculation unit 12, and a storage unit 13. For example, the observation sound collected by the microphone 10 in the monitoring area is converted into digital data by the A / D conversion unit 11 and supplied to the calculation unit 12 as input acoustic data 20.

演算部１２は、例えばＣＰＵ(Central Processing Unit)、マイクロプロセッサ、あるいは、ＤＳＰ(Digital Signal Processor)などを含む。また、記憶部１３は、例えば半導体メモリやＨＤＤ（ハードディスクドライブ）などからなり、入力音響データ２０が記憶されると共に、学習データに基づき作成された、入力音響データ２０を評価するスコアを算出するためのスコアパラメータが予め記憶されている。学習データは、検出したい不審音や異常音と同種類であって、異なる複数の音の音響データを用いる。記憶部１３は、また、演算部１２の作業領域としても用いることができる。 The arithmetic unit 12 includes, for example, a CPU (Central Processing Unit), a microprocessor, or a DSP (Digital Signal Processor). The storage unit 13 includes, for example, a semiconductor memory or an HDD (hard disk drive), and stores the input acoustic data 20 and calculates a score for evaluating the input acoustic data 20 created based on the learning data. The score parameters are stored in advance. The learning data is the same type as the suspicious sound or abnormal sound to be detected, and uses acoustic data of a plurality of different sounds. The storage unit 13 can also be used as a work area for the calculation unit 12.

演算部１２は、入力音響データ２０の特徴値を時系列上で算出し、算出された時系列上での特徴値を、記憶部１３に記憶されるスコアパラメータを用いて評価して、検出対象の不審音や異常音の発生区間を検出する。演算部１２は、入力音響データ２０から検出対象の不審音や異常音の発生区間を検出した場合、検出された不審音や異常音の発生区間を当該入力音響データ２０から切り取り、不審音発生区間の出力音響データ２１として出力する。出力音響データ２１は、例えば通信ネットワークを介して、監視サーバなどに送信される。 The calculation unit 12 calculates a feature value of the input acoustic data 20 on a time series, evaluates the calculated feature value on the time series using a score parameter stored in the storage unit 13, and performs detection Detect suspicious and abnormal sound occurrence sections. When detecting the suspicious sound or abnormal sound generation section to be detected from the input acoustic data 20, the calculation unit 12 cuts out the detected suspicious sound or abnormal sound generation section from the input acoustic data 20, and detects the suspicious sound generation section. Are output as output acoustic data 21. The output acoustic data 21 is transmitted to a monitoring server or the like via a communication network, for example.

本実施形態では、音響データの特徴値として、音響データの信号パワーの時系列での傾きを用いる。なお、これはこの例に限定されず、特徴値は、音響データの特徴を示す値であれば、他の値を用いてもよい。例えば、音響データ信号パワーそのものや、所定区間におけるゼロクロス数などを特徴値として用いてもよい。 In the present embodiment, the time-series gradient of the signal power of the acoustic data is used as the characteristic value of the acoustic data. Note that this is not limited to this example, and the feature value may be another value as long as it is a value indicating the feature of the acoustic data. For example, the acoustic data signal power itself or the number of zero crosses in a predetermined section may be used as the feature value.

図２は、音検出装置１００の機能をより詳細に説明するための一例の機能ブロック図である。なお、図２において、上述の図１と共通する部分には同一の符号を付して、詳細な説明を省略する。また、本実施形態では、上述した不審音および異常音を共に検出対象としている。そのため、以下では、特に記載の無い限り、不審音と異常音とを纏めて不審音として記述する。 FIG. 2 is a functional block diagram of an example for explaining the functions of the sound detection device 100 in more detail. In FIG. 2, the same reference numerals are given to portions common to those in FIG. 1 described above, and detailed description thereof is omitted. In the present embodiment, the above-described suspicious sound and abnormal sound are both detected. Therefore, in the following, suspicious sound and abnormal sound are collectively described as suspicious sound unless otherwise specified.

図２において、信号パワー算出部１０１、傾き算出部１０２、スコア算出部１０３および発生区間検出部１０４は、演算部１２に含まれる。スコアパラメータ１１１は、学習データに基づき予め作成されて記憶部１３に記憶される。また、記憶部１１０は、例えば上述の記憶部１３内の領域であって、観測音がＡ／Ｄ変換部１１でディジタルデータに変換された入力音響データ２０が一時的に記憶されると共に、後述するスコアや遅延時間補正量など、不審音発生区間を検出するために用いるデータが記憶される。 In FIG. 2, the signal power calculation unit 101, the slope calculation unit 102, the score calculation unit 103, and the generation interval detection unit 104 are included in the calculation unit 12. The score parameter 111 is created in advance based on the learning data and stored in the storage unit 13. The storage unit 110 is, for example, an area in the storage unit 13 described above, and temporarily stores the input acoustic data 20 obtained by converting the observation sound into digital data by the A / D conversion unit 11. The data used for detecting the suspicious sound generation section, such as the score to be corrected and the delay time correction amount, is stored.

信号パワー算出部１０１は、Ａ／Ｄ変換部１１から供給された入力音響データ２０の信号パワーを、時系列上で算出する。より具体的には、信号パワー算出部１０１は、入力音響データ２０の時系列上で連続する所定数のサンプルを単位として、信号パワーを算出する。ここで、信号パワーを算出する単位をフレームと呼び、フレームに含まれるサンプル数がフレーム幅となる。 The signal power calculation unit 101 calculates the signal power of the input acoustic data 20 supplied from the A / D conversion unit 11 on a time series. More specifically, the signal power calculation unit 101 calculates the signal power in units of a predetermined number of samples that are continuous on the time series of the input acoustic data 20. Here, the unit for calculating the signal power is called a frame, and the number of samples included in the frame is the frame width.

入力音響データ２０のｉ番目のフレームの信号パワーｙ_LP(ｉ)は、フレーム幅をＷ、フレーム内のｎ番目の波形データ（サンプル値）を値ｘ(ｎ)として、例えば次式（１）により算出される。

The signal power y _LP (i) of the i-th frame of the input acoustic data 20 is, for example, given by the following equation (1), where the frame width is W and the n-th waveform data (sample value) in the frame is a value x (n). Is calculated by

なお、信号パワーｙ_LP(ｉ)を算出するフレームは、直前のフレームと一部のサンプルが重複するように、フレームの先頭を基準とした所定サンプル数のフレーム間隔Ｄ毎に、用いるサンプルをずらして設定する。一例として、フレーム幅Ｗ＝１６０サンプルであって、ｍ番目のフレームｍが第１サンプル〜第１６０サンプルで構成されるものとして、ｍ＋１番目のフレーム（ｍ＋１）は、第８１サンプル〜第２４０サンプルで構成され、ｍ＋２番目のフレーム（ｍ＋２）は、第１６１サンプル〜第３２０サンプルで構成される。この場合、フレーム間隔Ｄ＝８０サンプルとなる。なお、この例では、フレームの重複部分の長さをフレーム幅Ｗ／２としたが、これはこの例に限定されない。 In the frame for calculating the signal power y _LP (i), samples to be used are shifted for every predetermined frame interval D with respect to the head of the frame so that some samples overlap the previous frame. To set. As an example, assuming that the frame width W = 160 samples and the mth frame m is composed of the first sample to the 160th sample, the m + 1st frame (m + 1) is the 81st sample to the 240th sample. The (m + 2) th frame (m + 2) is composed of the 161st to 320th samples. In this case, the frame interval D = 80 samples. In this example, the length of the overlapping portion of the frame is the frame width W / 2, but this is not limited to this example.

傾き検出部１０２は、信号パワー算出部１０１で算出されたフレーム毎の信号パワーｙ_LP(ｉ)の、時系列上での傾き（信号パワー時系列の傾きと呼ぶ）ｙ_GLP(ｉ)を算出する。フレームｉの信号パワー時系列の傾きｙ_GLP(ｉ)は、例えばフレームｉと、既に信号パワーを算出済みのフレーム（ｉ−４）、フレーム（ｉ−３）およびフレーム（ｉ−１）の、対象フレームｉに対して４フレーム分遡ったフレーム（ｉ−４）のうち、４フレーム分の信号パワーを用いて、次式（２）により算出できる。

The inclination detection unit 102 calculates a time-series inclination (referred to as a signal power time-series inclination) y _GLP (i) of the signal power y _LP (i) for each frame calculated by the signal power calculation unit 101. To do. The gradient y _GLP (i) of the signal power time series of the frame i is, for example, the frame i, the frame (i-4), the frame (i-3), and the frame (i-1) for which the signal power has already been calculated. It can be calculated by the following equation (2) using the signal power of 4 frames out of the frame (i-4) that goes back 4 frames from the target frame i.

なお、式（２）では、対象フレームｉに対して４フレーム前から４フレーム分の信号パワーを用いて傾きｙ_GLP(ｉ)を算出しているが、これはこの例に限定されない。すなわち、対象フレームｉから何フレーム分遡ったフレームからデータを用いるかは、信号パワー時系列の傾きｙ_GLP(ｉ)を求める式の構成に応じて決められる。また、各定数の値も、この例に用いた値に限定されない。 In Equation (2), the gradient y _GLP (i) is calculated using the signal power of four frames from the previous four frames with respect to the target frame i, but this is not limited to this example. In other words, the number of frames going back from the target frame i to use data is determined according to the configuration of the equation for _obtaining the slope y _GLP (i) of the signal power time series. Also, the value of each constant is not limited to the value used in this example.

スコア算出部１０３は、傾き算出部１０２で入力音響データ２０について算出された信号パワー時系列の傾きｙ_GLP(ｉ)と、学習データを用いて予め作成され例えば記憶部１３に記憶されたスコアパラメータ１１１とに基づき、入力音響データ２０を評価するためのスコアを算出する。スコアパラメータ１１１は、予め用意された学習データに基づき作成された、検出対象となる不審音に対する期待値を含む。そして、算出されたスコアの時系列的な変化に基づき不審音発生区間の検出を行う。 The score calculation unit 103 uses the signal power time-series gradient y _GLP (i) calculated for the input acoustic data 20 by the gradient calculation unit 102 and the learning parameter to create score parameters stored in the storage unit 13 in advance. 111, a score for evaluating the input acoustic data 20 is calculated. The score parameter 111 includes an expected value for a suspicious sound to be detected, which is created based on learning data prepared in advance. Then, the suspicious sound generation interval is detected based on the time-series change of the calculated score.

スコアパラメータ１１１は、下記のようにして作成する。先ず、検出対象となる不審音と種類が同じであって、互いに音が異なる複数の音響データをそれぞれ学習データとして用意する。例えば、検出対象となる不審音がガラスの破壊音である場合、異なるサイズ、厚み、材質など様々な条件でガラスを破壊した際の破壊音を集音した各音響データを、それぞれ学習データとして用いる。 The score parameter 111 is created as follows. First, a plurality of acoustic data having the same type as the suspicious sound to be detected and having different sounds are prepared as learning data. For example, when the suspicious sound to be detected is a glass breaking sound, each acoustic data obtained by collecting the breaking sound when the glass is broken under various conditions such as different sizes, thicknesses, and materials is used as learning data. .

この学習データのそれぞれについて、上述したようにして、フレーム毎に信号パワーを求め、求めた信号パワーを用いて信号パワー時系列の傾きを算出する。そして、複数の学習データの、互いに対応するフレームｋの信号パワー時系列の傾きの期待値μ_kおよび分散値σ² _kを算出する。算出された期待値μ_kおよび分散値σ² _kを、フレームｋのスコアパラメータ１１１として、記憶部１３に記憶する。 For each of the learning data, the signal power is obtained for each frame as described above, and the slope of the signal power time series is calculated using the obtained signal power. Then, the expected value μ _k and the variance value σ ² _k of the slope of the signal power time series of the corresponding frame k of the plurality of learning data are calculated. The calculated expected value μ _k and variance value σ ² _k are stored in the storage unit 13 as the score parameter 111 of the frame k.

図３を用いて、スコアパラメータ１１１の算出方法についてより詳細に説明する。先ず、学習データに対してフレームを設定する。フレーム幅Ｗおよびフレーム間隔Ｄは、上述の信号パワー算出部１０１で入力音響データ２０に対して設定したフレームと同一とする。なお、信号パワー時系列の傾きの算出は、上述した式（２）に従い、対象フレームｉから４フレーム分遡ったフレームからの信号パワー値を用いて行うものとする。 The calculation method of the score parameter 111 will be described in more detail with reference to FIG. First, a frame is set for learning data. The frame width W and the frame interval D are the same as the frames set for the input acoustic data 20 by the signal power calculation unit 101 described above. Note that the slope of the signal power time series is calculated using the signal power value from a frame that is four frames back from the target frame i according to the above-described equation (2).

なお、以下では、不審音発生区間の開始位置を含むフレームから、当該フレームの信号パワー時系列の傾きを算出するために必要なフレーム分を遡ったときの先頭のフレームを、１番目のフレーム（フレーム＃１）とする。 In the following description, the first frame (the first frame (when the frame necessary for calculating the slope of the signal power time series of the frame is traced from the frame including the start position of the suspicious sound generation section) is referred to as the first frame ( Frame # 1).

学習データの波形の立ち上がり時刻、すなわち不審音発生区間の開始位置に係るフレーム（図３の例ではフレーム＃５）を設定する。そして、このフレーム＃５から学習データの波形の減衰方向（不審音発生区間の終了位置方向）に向けて、フレーム間隔Ｄで順次フレームを設定する（フレーム＃６〜＃８）。また、不審音発生区間の開始位置に係るフレームについて信号パワー時系列の傾きを算出する場合、この例では当該フレームに対して４フレーム分遡った位置からフレームが必要となるので、これらのフレームも設定する。図３の例では、フレーム＃５から時系列を遡った方向に、フレーム＃４〜＃１が設定される。なお、不審音発生区間の開始位置に係るフレーム（フレーム＃５）は、フレームの略中央が当該開始位置になるように設定すると好ましい。 A frame (frame # 5 in the example of FIG. 3) related to the rise time of the waveform of the learning data, that is, the start position of the suspicious sound generation interval is set. Then, frames are sequentially set at a frame interval D from the frame # 5 toward the direction of attenuation of the waveform of the learning data (end position direction of the suspicious sound generation interval) (frames # 6 to # 8). In addition, when calculating the slope of the signal power time series for a frame related to the start position of the suspicious sound generation section, in this example, since the frame is required from a position that is four frames back from the frame, these frames are also Set. In the example of FIG. 3, frames # 4 to # 1 are set in a direction that goes back in time series from frame # 5. Note that the frame (frame # 5) related to the start position of the suspicious sound generation section is preferably set so that the approximate center of the frame is the start position.

音が異なる複数の学習データのそれぞれに対して、同様にして、波形の立ち上がり時刻を基準として各フレームが設定される。 Similarly, each frame is set with respect to each of a plurality of learning data having different sounds on the basis of the rise time of the waveform.

なお、この例では、信号パワー時系列の傾きを算出するために用いるフレーム数を４フレームとし、学習データからスコアパラメータ１１１を算出するために用いる総フレーム数を８フレームとしているが、これはこの例に限定されず、例えばさらに多くのフレームを用いてもよい。また、学習データの波形の立ち上がり時刻に対して中央を揃えるフレームを、５番目のフレームとしているが、これはこの例に限定されず、スコアパラメータ１１１を算出するために用いる総フレーム数に合わせて別のフレームを用いてもよい。さらに、学習データにおいては、波形の立ち上がり時刻より前のデータが存在しない場合が考えられる。この場合には、値「０」のデータ（無音データ）が存在するものと見做してフレームの設定を行う。 In this example, the number of frames used to calculate the slope of the signal power time series is 4 frames, and the total number of frames used to calculate the score parameter 111 from the learning data is 8 frames. For example, more frames may be used. In addition, the frame that aligns the center with respect to the rise time of the waveform of the learning data is the fifth frame, but this is not limited to this example, and is matched to the total number of frames used to calculate the score parameter 111. Another frame may be used. Furthermore, there may be a case where there is no data before the waveform rise time in the learning data. In this case, the frame is set on the assumption that data of “0” (silence data) exists.

次に、上述のようにして各学習データに対して設定されたフレーム毎に信号パワーを算出し、各学習データについて、信号パワー時系列の傾きを算出する。そして、各学習データの対応するフレームｋにおける信号パワー時系列の傾きの期待値μ_kおよび分散値σ² _kを算出する。 Next, signal power is calculated for each frame set for each learning data as described above, and the slope of the signal power time series is calculated for each learning data. Then, the expected value μ _k and variance value σ ² _k of the slope of the signal power time series in the corresponding frame k of each learning data are calculated.

一例として、波形の立ち上がり部分を含むフレーム＃５を基点とし、フレーム＃５〜フレーム＃８のそれぞれについて、各学習データに基づき信号パワー時系列の傾きに関するヒストグラムを作成する。図４は、学習データとしてガラス打撃音の音響データを用いた場合の、フレーム＃５〜フレーム＃８それぞれに関する、信号パワー時系列の傾きの分布（ヒストグラム）の例を示す。図４（ａ）はフレーム＃５の例、図４（ｂ）はフレーム＃６の例、図４（ｃ）はフレーム＃７の例、図４（ｄ）はフレーム＃８の例である。図４（ａ）、図４（ｂ）、図４（ｃ）および図４（ｄ）において、横軸が信号パワー時系列の傾きの階級、縦軸が頻度を示す。 As an example, a histogram relating to the slope of the signal power time series is created for each of the frames # 5 to # 8 based on the learning data with the frame # 5 including the rising portion of the waveform as a base point. FIG. 4 shows an example of the distribution (histogram) of the gradient of the signal power time series for each of the frames # 5 to # 8 when the acoustic data of the glass impact sound is used as the learning data. 4A shows an example of frame # 5, FIG. 4B shows an example of frame # 6, FIG. 4C shows an example of frame # 7, and FIG. 4D shows an example of frame # 8. 4 (a), 4 (b), 4 (c), and 4 (d), the horizontal axis represents the slope of the signal power time series, and the vertical axis represents the frequency.

これら図４（ａ）〜図４（ｄ）のヒストグラムに基づき、フレーム＃５〜フレーム＃８それぞれについて、期待値μおよび分散値σ²を求めることができる。なお、期待値μ_kおよび分散値σ² _kの算出は、周知の方法を用いることができるので、ここでの説明を省略する。図５は、図４（ａ）〜図４（ｄ）のヒストグラムに基づき算出した、フレーム＃５〜フレーム＃８それぞれの期待値μおよび分散値σ²の例を示す。各フレームｋに対して、それぞれ期待値μ_kおよび分散値σ² _kが算出される。算出されたこれら各フレームｋの期待値μ_kおよび分散値σ² _kは、スコアパラメータ１１１として記憶部１３に記憶される。 Based on these histograms of FIGS. 4A to 4D, the expected value μ and the variance value σ ² can be obtained for each of the frames # 5 to # 8. Note that the calculation of the expected value μ _k and the variance value σ ² _k can be performed using a known method, and thus the description thereof is omitted here. FIG. 5 shows an example of the expected value μ and variance value σ ² of each of frame # 5 to frame # 8 calculated based on the histograms of FIGS. 4 (a) to 4 (d). Expected value μ _k and variance value σ ² _k are calculated for each frame k. The calculated expected value μ _k and variance value σ ² _k of each frame k are stored in the storage unit 13 as the score parameter 111.

スコア算出部１０３は、こうして算出し記憶部１３にスコアパラメータ１１１として記憶された各フレームｋの期待値μ_kおよび分散値σ² _kを用いて、次式（３）に例示される、入力音響データ２０のフレームｉにおけるスコアＳ(ｉ)を算出するスコア算出式を得る。このスコア算出式で算出されたスコアＳ(ｉ)により、入力音響データ２０におけるフレームｉを評価することができる。

The score calculation unit 103 uses the expected value μ _k and the variance value σ ² _k of each frame k thus calculated and stored as the score parameter 111 in the storage unit 13, and the input sound exemplified by the following equation (3) A score calculation formula for calculating the score S (i) in the frame i of the data 20 is obtained. The frame i in the input acoustic data 20 can be evaluated by the score S (i) calculated by this score calculation formula.

なお、式（３）において、傾きｙ_GLP(ｉ＋ｋ−８)に含まれる値「８」および総和の終了を示す値「８」は、学習データからスコアパラメータ１１１を算出するために用いた総フレーム数である。また、総和の開始を示す値「５」は、不審音発生区間の開始位置を含むフレーム番号の、学習データからスコアパラメータ１１１を算出するために用いる先頭のフレームから数えたフレーム番号である。これらの値は、信号パワー時系列の傾きｙ_GLP(ｉ)を求める式の構成などに応じて決められる。さらに、式（３）において、スコアの最大値を「０」にするために、右辺の全体に負符号が付されている。 In equation (3), the value “8” included in the gradient y _GLP (i + k−8) and the value “8” indicating the end of the sum are the total frames used to calculate the score parameter 111 from the learning data. Is a number. The value “5” indicating the start of the sum is a frame number counted from the first frame used for calculating the score parameter 111 from the learning data of the frame number including the start position of the suspicious sound generation section. These values are determined according to the configuration of an equation for _obtaining the slope y _GLP (i) of the signal power time series. Further, in the expression (3), in order to set the maximum score value to “0”, a negative sign is assigned to the entire right side.

すなわち、式（３）は、入力音響データ２０におけるスコア算出の対象となるフレームｉの信号パワー時系列の傾きと期待値との差分の二乗を、学習データにおいて不審音発生区間の開始位置から４フレーム分順次フレームをずらして求めた総和に基づき、フレームｉのスコアを算出している。分散値は、分子の値を正規化する。なお、式（３）では、信号パワー時系列の傾きと期待値との差分の二乗を用いているが、これはこの例に限定されず、例えば差分の絶対値を用いてもよい。 That is, the equation (3) is obtained by calculating the square of the difference between the slope of the signal power time series of the frame i to be score-calculated in the input acoustic data 20 and the expected value from the start position of the suspicious sound generation interval in the learning data. The score of frame i is calculated based on the sum obtained by sequentially shifting the frames. The variance value normalizes the numerator value. In equation (3), the square of the difference between the slope of the signal power time series and the expected value is used, but this is not limited to this example, and the absolute value of the difference may be used, for example.

スコア算出部１０３は、傾き算出部１０２で入力音響データ２０の各フレームｉについて算出された、信号パワー時系列の傾きｙ_GLP(ｉ)を式（３）に順次適用して、各フレームｉのスコアＳ(ｉ)を算出する。 The score calculation unit 103 sequentially applies the gradient y _GLP (i) of the signal power time series calculated for each frame i of the input acoustic data 20 by the gradient calculation unit 102 to the equation (3), and calculates the value of each frame i. Score S (i) is calculated.

発生区間検出部１０４は、スコア算出部１０３で算出された入力音響データ２０のフレームｉのスコアＳ(ｉ)から、フレームｉのスコア時系列の傾きＧＳ(ｉ)を算出する。この例では、スコア時系列の傾きＧＳ(ｉ)は、上述した信号パワー時系列の傾きの算出と同様に、対象となるフレームｉに対して４フレーム分遡ったフレーム（ｉ−４）から４個のスコアＳ（ｉ−４）、スコアＳ（ｉ−３）、スコアＳ（ｉ−１）およびスコアＳ(ｉ)を用いて、例えば次式（４）を用いて算出される。

The generation section detection unit 104 calculates the score time-series slope GS (i) of the frame i from the score S (i) of the frame i of the input acoustic data 20 calculated by the score calculation unit 103. In this example, the slope GS (i) of the score time series is 4 from the frame (i-4) that is four frames backward from the target frame i, as in the calculation of the slope of the signal power time series described above. Using the score S (i-4), the score S (i-3), the score S (i-1), and the score S (i), for example, the following equation (4) is used.

なお、ここでは、スコア時系列の傾きＧＳ(ｉ)を算出するために、４個のスコアを用いたが、これはこの例に限定されない。また、スコア時系列の傾きは、スコア算出部１０３で算出してもよい。 Note that, here, four scores are used to calculate the slope GS (i) of the score time series, but this is not limited to this example. In addition, the score time series slope may be calculated by the score calculation unit 103.

発生区間検出部１０４は、算出されたスコア時系列の傾きＧＳ(ｉ)に基づき、スコアＳ(ｉ)が極大値または極小値であるか否かを判定する。すなわち、スコア時系列の傾きＧＳ(ｉ)が下記の条件（Ａ）を満たすとき、スコアＳ(ｉ)は極大値を取る。
ＧＳ(ｉ−１)＞０且つＧＳ(ｉ)≦０ …（Ａ） The occurrence section detection unit 104 determines whether the score S (i) is a maximum value or a minimum value based on the calculated slope GS (i) of the score time series. That is, when the slope GS (i) of the score time series satisfies the following condition (A), the score S (i) takes a maximum value.
GS (i-1)> 0 and GS (i) ≦ 0 (A)

同様に、スコア時系列の傾きＧＳ(ｉ)が下記の条件（Ｂ）を満たすとき、スコアＳ(ｉ)は極小値を取る。
ＧＳ(ｉ−１)＜０且つＧＳ(ｉ)≧０ …（Ｂ） Similarly, when the slope GS (i) of the score time series satisfies the following condition (B), the score S (i) takes a minimum value.
GS (i−1) <0 and GS (i) ≧ 0 (B)

発生区間検出部１０４は、スコアＳ(ｉ)が極大値であった場合、当該スコアＳ(ｉ)が閾値を超えているか否かを判定し、超えていれば、当該スコアＳ(ｉ)に対応するフレームｉが波形の立ち上がり位置を含むものと判定する。閾値は、予め実験的手法などにより求めて、記憶部１３に記憶しておく。ここで、学習データにおいて、波形の立ち上がり位置に係るフレーム（例えばフレーム＃５）の略中央が当該開始位置になるように設定されている場合、当該フレームｉの略中央の位置が波形の立ち上がり位置とされる。 When the score S (i) is a maximum value, the occurrence section detection unit 104 determines whether or not the score S (i) exceeds a threshold, and if it exceeds, the score S (i) is set to the score S (i). It is determined that the corresponding frame i includes the rising position of the waveform. The threshold value is obtained in advance by an experimental method or the like and stored in the storage unit 13. Here, in the learning data, when the approximate center of a frame (for example, frame # 5) related to the waveform rising position is set to be the start position, the position of the approximate center of the frame i is the waveform rising position. It is said.

一方、発生区間検出部１０４は、波形の立ち上がり位置を含むフレームを検出した後に最初に極小値をとったスコアＳ(ｉ)について、当該スコアＳ(ｉ)に対応するフレーム(ｉ)が波形の立ち下がり位置を含むものと判定する。この場合も、学習データにおいて、波形の立ち上がり位置に係るフレーム（例えばフレーム＃５）の略中央が当該立ち上がり位置になるように設定されている場合、当該フレームｉの略中央の位置が波形の立ち下がり位置とされる。 On the other hand, for the score S (i) that first takes the minimum value after detecting the frame including the rising position of the waveform, the generation section detection unit 104 has the waveform (i) corresponding to the score S (i) as the waveform. It is determined that the falling position is included. Also in this case, in the learning data, when the approximate center of the frame related to the rising position of the waveform (for example, frame # 5) is set to be the rising position, the position of the approximate center of the frame i is the waveform rising position. The position is lowered.

発生区間検出部１０４は、波形の立ち上がり位置および立ち下がり位置が検出されると、検出された波形の立ち上がり位置を不審音発生区間の開始位置とし、立ち下がり位置を当該不審音発生区間の終了位置とする。これにより、不審音発生区間が検出される。 When the rising position and the falling position of the waveform are detected, the generation section detection unit 104 sets the detected rising position of the waveform as the start position of the suspicious sound generation section, and sets the falling position as the end position of the suspicious sound generation section. And Thereby, a suspicious sound generation area is detected.

ここで、式（３）に示すスコア算出式の意味について説明する。スコア算出式である式（３）に用いられる信号パワー時系列の傾きｙ_GLP(ｉ)は、上述の式（２）で算出され、式（２）に用いられる信号パワーｙ_LP(ｉ)は、上述の式（１）で算出される。図６は、こうして求めた信号パワーｙ_LP(ｉ)、信号パワー時系列の傾きｙ_GLP(ｉ)およびスコアＳ(ｉ)を、入力音響データ２０の各フレームｉについて求めた値の例を示す。 Here, the meaning of the score calculation formula shown in Formula (3) will be described. The slope y _GLP (i) of the signal power time series used in the equation (3), which is a score calculation formula, is calculated by the above equation (2), and the signal power y _LP (i) used in the equation (2) is , Calculated by the above-described equation (1). FIG. 6 shows an example of values obtained for the frame i of the input acoustic data 20 for the signal power y _LP (i), the slope y _GLP (i) of the signal power time series, and the score S (i) thus obtained. .

学習データに基づく特徴値の期待値μおよび分散値σ²は、学習データにおける物音（不審音）の波形の立ち上がりを含むフレームおよび当該フレームから所定数のフレーム（上述の例では４フレーム）のそれぞれについて算出された値である。図７は、学習データの各フレームｋのそれぞれについて求めた期待値μ_kおよび分散値σ² _kの例を示す。 The expected value μ and the variance value σ ² of the feature value based on the learning data are respectively a frame including a rising edge of a sound of a sound (suspicious sound) in the learning data and a predetermined number of frames (four frames in the above example) from the frame. Is a value calculated for. FIG. 7 shows an example of the expected value μ _k and the variance value σ ² _k obtained for each frame k of the learning data.

図８は、図６に示す信号パワー、信号パワー時系列の傾きおよびスコアを、フレーム番号に対してプロットしたグラフを示す。なお、このグラフは、後述するスコア算出に伴う遅延の補正がなされていない。図８の例では、スコアのプロットは、信号パワーのプロットに対して、３フレーム分遅延している。 FIG. 8 shows a graph in which the signal power, the slope of the signal power time series, and the score shown in FIG. 6 are plotted against the frame number. In this graph, the delay associated with score calculation described later is not corrected. In the example of FIG. 8, the score plot is delayed by 3 frames relative to the signal power plot.

学習データにおける物音と、入力音響データ２０に含まれる物音との類似性が高い位置、すなわち、物音の立ち上がり位置で、期待値μと入力音響データ２０の特徴値との差分に基づくスコアは、最大値を取る（スコアのプロットにおけるフレーム＃９の位置）。そこで、式（３）のΣ部分を参照し、所定フレーム数分の総和が最も大きな値を取る点でスコアは極大値を取り、そのフレームを物音の発生区間の開始位置を含むフレームとする。 The score based on the difference between the expected value μ and the characteristic value of the input sound data 20 at the position where the similarity between the sound in the learning data and the sound included in the input sound data 20 is high, that is, the rising position of the sound, is the maximum. Take the value (position of frame # 9 in the score plot). Therefore, with reference to the Σ portion of Equation (3), the score has a maximum value in that the sum of the predetermined number of frames takes the largest value, and that frame is set as a frame including the start position of the sound generation interval.

物音の発生区間の開始位置以降、入力音響データ２０の信号パワーは減衰する（信号パワーのプロットにおけるフレーム番号＃７〜＃１０）。それに伴い、入力音響データ２０の信号パワー時系列の傾きが負値を取る（信号パワー時系列の傾きのプロットにおけるフレーム＃８、＃９）。したがって、式（３）における「ｙ_GLP(ｉ＋ｋ−８)−μ_k」の二乗の値が大きくなり、この二乗値の４フレーム分の総和に負符号が付されたスコアＳ(ｉ)は、小さな値となる（スコアのプロットにおけるフレーム＃１２、＃１３）。スコアＳ(ｉ)の値が最も小さくなるとき、スコアＳ(ｉ)が極小値を取り（スコアのプロットにおけるフレーム＃１３）、この極小値を取ったフレームを物音の発生区間の終了位置を含むフレームと見做すことができる。 The signal power of the input sound data 20 is attenuated after the start position of the sound generation interval (frame numbers # 7 to # 10 in the signal power plot). Accordingly, the slope of the signal power time series of the input acoustic data 20 takes a negative value (frames # 8 and # 9 in the plot of the slope of the signal power time series). Therefore, the value of the square of “y _GLP (i + k−8) −μ _k ” in the equation (3) is large, and the score S (i) in which the sum of the square value of 4 frames is added with a negative sign is It becomes a small value (frames # 12 and # 13 in the score plot). When the value of the score S (i) is the smallest, the score S (i) takes a minimum value (frame # 13 in the score plot), and the frame that takes this minimum value includes the end position of the sound generation interval. It can be regarded as a frame.

図９および図１０は、上述のようにして検出された不審音発生区間の例を示す。図９は、観測音に雑音（環境音）が少ない場合の例であり、図１０は、観測音に雑音が多い場合の例である。図９および図１０では、それぞれ同一のスコアパラメータ１１１と不審音発生区間の検出のための閾値とを用いて、不審音発生区間の検出を行っている。 9 and 10 show examples of the suspicious sound generation interval detected as described above. FIG. 9 shows an example when the observation sound has a little noise (environmental sound), and FIG. 10 shows an example when the observation sound has a lot of noise. In FIG. 9 and FIG. 10, the suspicious sound generation interval is detected using the same score parameter 111 and the threshold value for detecting the suspicious sound generation interval.

図９および図１０それぞれにおいて、上側のグラフは入力音響データ２０を示し、下側のグラフは入力音響データ２０に対するスコアと、スコアに基づき検出された不審音発生区間とを示す。不審音発生区間は、値がＨｉｇｈレベルで不審音発生区間を示す。なお、図９および図１０では、スコアのグラフにおいて、後述するスコア算出に係る遅延が補正されている。 9 and 10, the upper graph shows the input sound data 20, and the lower graph shows the score for the input sound data 20 and the suspicious sound generation interval detected based on the score. The suspicious sound generation section indicates a suspicious sound generation section having a high level value. In FIGS. 9 and 10, in the score graph, a delay related to score calculation described later is corrected.

図９において、上側の入力音響データ２０のグラフに、時刻「３０００」付近で不審音が発生し、この不審音が時間「２００」程度で急激に減衰している様子が示されている。一方、下側のグラフにおいて、上述した条件（Ａ）に従い、スコアは、時刻「３０００」付近で大きな極大値を取り、時刻「５５００」付近でやや大きな極大値を取っている。この例では、時刻「３０００」付近のスコアの極大値が閾値を超えており、時刻「５５００」付近のスコアの極大値は、閾値を超えていないものとする。さらに、スコアは、上述した条件（Ｂ）に従い、閾値を超える時刻「３０００」付近の極大値の後、時刻「３３００」付近で極小値を取っている。したがって、時刻「３０００」付近から時刻「３３００」付近が不審音発生区間と判断することができる。 In FIG. 9, the graph of the input sound data 20 on the upper side shows that a suspicious sound is generated near the time “3000”, and this suspicious sound is rapidly attenuated at a time “200”. On the other hand, in the lower graph, according to the condition (A) described above, the score has a large maximum value near the time “3000” and a slightly large maximum value near the time “5500”. In this example, it is assumed that the maximum value of the score near time “3000” exceeds the threshold, and the maximum value of the score near time “5500” does not exceed the threshold. Further, the score takes a local minimum value near the time “3300” after the local maximum value near the time “3000” exceeding the threshold in accordance with the condition (B) described above. Therefore, it is possible to determine that the vicinity of time “3000” to time “3300” is the suspicious sound generation interval.

また、図１０の観測音に雑音が多い場合についても、図９の観測音に雑音が少ない場合と同様の結果が得られることが分かる。これにより、本実施形態の音検出装置を用いることで、雑音環境下であっても不審音発生区間を容易に検出可能であることが分かる。 In addition, it can be seen that the same result as in the case where the observation sound in FIG. Thereby, it turns out that a suspicious sound generation area can be easily detected by using the sound detection device of the present embodiment even in a noisy environment.

ところで、既に述べたように、スコア算出の際には、入力音響データ２０の対象となるフレームの前後のフレームを用いる。そのため、図１１に示されるように、算出されたスコアに基づき求めた入力音響データ２０における不審音による波形の立ち上がり位置および立ち下がり位置は、実際の入力音響データ２０における波形の立ち上がり位置および立ち下がり位置に対して遅延を有する。そのため、入力音響データ２０から不審音発生区間を切り取るためには、この遅延の補正を行う必要がある。 By the way, as already described, when calculating the score, frames before and after the target frame of the input acoustic data 20 are used. Therefore, as shown in FIG. 11, the rising position and the falling position of the waveform due to the suspicious sound in the input acoustic data 20 obtained based on the calculated score are the rising position and the falling position of the waveform in the actual input acoustic data 20. Has a delay with respect to position. Therefore, in order to cut out the suspicious sound generation section from the input acoustic data 20, it is necessary to correct this delay.

遅延を補正する遅延補正量は、入力音響データ２０のサンプリング周波数、フレーム幅Ｗおよびフレーム間隔Ｄに依存する。すなわち、スコア算出にフレーム＃１〜フレーム＃８の８フレームを用い、学習データにおける波形の立ち上がり位置にフレーム＃５を対応させる上述の例では、図１２に例示されるように、入力音響データ２０について、信号パワーを算出するために１フレーム幅Ｗの時間を要し、各フレームの信号パワーは、フレーム間隔Ｄ毎に算出される。また、信号パワー時系列の傾きの算出には、５フレーム後、すなわち１フレーム幅Ｗ＋４フレーム間隔Ｄを要する。さらに、スコアを算出するために、４フレーム分を用いるため、４フレーム間隔Ｄを要する。したがって、フレームｉのスコアを算出するために、１フレーム幅Ｗ＋７フレーム間隔Ｄ＝９フレーム間隔Ｄを要することになる。 The delay correction amount for correcting the delay depends on the sampling frequency, the frame width W, and the frame interval D of the input acoustic data 20. That is, in the above example in which 8 frames of frame # 1 to frame # 8 are used for score calculation, and frame # 5 is associated with the rising position of the waveform in the learning data, as illustrated in FIG. Therefore, it takes a time of one frame width W to calculate the signal power, and the signal power of each frame is calculated for each frame interval D. Further, calculation of the slope of the signal power time series requires 5 frames later, that is, 1 frame width W + 4 frame intervals D. Furthermore, since 4 frames are used to calculate the score, a 4-frame interval D is required. Therefore, in order to calculate the score of frame i, 1 frame width W + 7 frame interval D = 9 frame interval D is required.

より具体的な例として、入力音響データ２０のサンプリング周波数が１６ｋＨｚ（キロヘルツ）、１フレーム幅Ｗのサンプル数が１６０サンプル、１フレーム間隔Ｄのサンプル数８０サンプルの例では、遅延補正量は、８０サンプル×９＝７２０サンプル分となる。この遅延補正量は、時間に換算すると、７２０サンプル×(１／１６０００)＝０．０４５ｓｅｃ（４５ミリ秒）となる。 As a more specific example, in the example where the sampling frequency of the input acoustic data 20 is 16 kHz (kilohertz), the number of samples of one frame width W is 160 samples, and the number of samples is 80 samples of one frame interval D, the delay correction amount is 80 Sample × 9 = 720 samples. This delay correction amount is 720 samples × (1/16000) = 0.045 sec (45 milliseconds) in terms of time.

発生区間検出部１０４は、検出された不審音発生区間の開始位置および終了位置の時刻から、この遅延補正量を差し引いた値を、補正済み不審音発生区間の開始位置および終了位置の時刻とする。そして、記憶部１３に記憶された入力音響データ２０から、この補正済み不審音発生区間のデータを切り取り、出力音響データ２１として出力する。 The generation section detection unit 104 sets the value obtained by subtracting this delay correction amount from the time of the start position and end position of the detected suspicious sound generation section as the time of the start position and end position of the corrected suspicious sound generation section. . Then, the corrected suspicious sound generation section data is cut out from the input acoustic data 20 stored in the storage unit 13 and output as output acoustic data 21.

図１３は、本実施形態による不審音発生区間の検出処理を示す一例のフローチャートである。このフローチャートによる各処理は、例えば演算部１２が含む図示されないＣＰＵにより、例えば記憶部１３に予め記憶されるプログラムに従って実行される。プログラムは、例えば、信号パワー算出部１０１、傾き算出部１０２、スコア算出部１０３および発生区間検出部１０４をそれぞれ実現するモジュールを含み、ＣＰＵにより実行されると、図示されない主記憶上にこれら各部のモジュールを展開し、実行する。 FIG. 13 is a flowchart of an example showing the detection processing of the suspicious sound generation section according to the present embodiment. Each process according to this flowchart is executed by, for example, a CPU (not shown) included in the calculation unit 12 according to a program stored in the storage unit 13 in advance. The program includes, for example, modules that respectively implement a signal power calculation unit 101, an inclination calculation unit 102, a score calculation unit 103, and a generation interval detection unit 104. When the program is executed by the CPU, each program is stored in a main memory (not shown). Expand and run the module.

これに限らず、演算部１２に含まれる信号パワー算出部１０１、傾き算出部１０２、スコア算出部１０３および発生区間検出部１０４をそれぞれ別個のハードウェアによって構成し、各部が協働してフローチャートにおける各処理を実行してもよい。 Not only this but the signal power calculation part 101, the inclination calculation part 102, the score calculation part 103, and the generation | occurrence | production area detection part 104 which are contained in the calculating part 12 are each comprised by separate hardware, and each part cooperates in a flowchart. Each process may be executed.

図１３において、ステップＳ１００で、マイクロフォン１０から、集音された観測音に従ったアナログ音声信号が出力される。このアナログ音声信号は、Ａ／Ｄ変換部１１でディジタル音声信号に変換され、入力音響データ２０として信号パワー算出部１０１に供給される。入力音響データ２０は、記憶部１１０にも供給され、記憶される。 In FIG. 13, in step S <b> 100, an analog audio signal according to the collected observation sound is output from the microphone 10. The analog audio signal is converted into a digital audio signal by the A / D conversion unit 11 and supplied to the signal power calculation unit 101 as input acoustic data 20. The input acoustic data 20 is also supplied to and stored in the storage unit 110.

信号パワー算出部１０１は、入力された入力音響データ２０に対してフレームｉを設定し、上述した式（１）に従い、設定されたフレームｉの信号パワーｙ_LP(ｉ)を算出する（ステップＳ１０１）。算出された信号パワーｙ_LP(ｉ)の値は、例えば記憶部１１０に一時的に保持される。次のステップＳ１０２で、傾き算出部１０２は、既に算出された所定数の信号パワーの値を記憶部１１０から取り出し、上述した式（２）に従い信号パワー時系列の傾きｙ_GLP(ｉ)を算出する。算出された信号パワー時系列の傾きｙ_GLP(ｉ)は、記憶部１１０に保持される。 The signal power calculation unit 101 sets a frame i for the input sound data 20 that has been input, and calculates the signal power y _LP (i) of the set frame i according to the above-described equation (1) (step S101). ). The value of the calculated signal power y _LP (i) is temporarily stored in the storage unit 110, for example. In the next step S102, the slope calculation unit 102 retrieves the predetermined number of signal power values already calculated from the storage unit 110, and calculates the slope y _GLP (i) of the signal power time series according to the above-described equation (2). To do. The calculated signal power time series gradient y _GLP (i) is held in the storage unit 110.

次にステップＳ１０３で、スコア算出部１０３は、既に算出された信号パワー時系列の傾きｙ_GLP(ｉ)の値と、学習データに基づき予め算出されたスコアパラメータ１１１とを、記憶部１１０から取り出し、上述した式（３）に従いフレームｉのスコアＳ(ｉ)を算出する。算出されたスコアＳ(ｉ)は、記憶部１１０に保持される。次のステップＳ１０４で、発生区間検出部１０４は、既に算出されたスコアの値を記憶部１１０から取り出し、上述した式（４）に従いスコア時系列の傾きＧＳ(ｉ)を算出する。 Next, in step S103, the score calculation unit 103 extracts from the storage unit 110 the value of the already calculated slope y _GLP (i) of the signal power time series and the score parameter 111 calculated in advance based on the learning data. Then, the score S (i) of the frame i is calculated according to the above equation (3). The calculated score S (i) is held in the storage unit 110. In the next step S104, the occurrence section detection unit 104 extracts the already calculated score value from the storage unit 110, and calculates the score time-series slope GS (i) according to the above-described equation (4).

次のステップＳ１０５で、発生区間検出部１０４は、上述した条件（Ａ）を参照し、算出されたスコア時系列の傾きＧＳ(ｉ)が極大値であるか否かを判定する。若し、スコアＳ(ｉ)が極大値であると判定したら、処理をステップＳ１０６に移行させ、極大値であるスコアＳ(ｉ)が予め決められた閾値を超えているか否かを判定する。若し、超えていないと判定したら、処理をステップＳ１００に戻す。 In the next step S105, the occurrence section detection unit 104 refers to the condition (A) described above, and determines whether or not the calculated slope GS (i) of the score time series is a maximum value. If it is determined that the score S (i) is a maximum value, the process proceeds to step S106, and it is determined whether or not the score S (i) that is the maximum value exceeds a predetermined threshold value. If it is determined that it does not exceed, the process returns to step S100.

一方、ステップＳ１０６で、スコアＳ(ｉ)が当該閾値を超えていると判定したら、処理はステップＳ１０７に移行され、フレームｉの略中央に不審音の波形の立ち上がりが検出されたものとする。そして、処理をステップＳ１００に戻す。 On the other hand, if it is determined in step S106 that the score S (i) exceeds the threshold value, the process proceeds to step S107, and it is assumed that the rising of the waveform of the suspicious sound is detected at the approximate center of the frame i. Then, the process returns to step S100.

上述のステップＳ１０５で、スコアＳ(ｉ)が極大値ではないと判定されたら、処理はステップＳ１０８に移行される。ステップＳ１０８で、発生区間検出部１０４は、上述した条件（Ｂ）を参照し、当該スコアＳ(ｉ)が極小値であるか否かを判定する。若し、極小値ではないと判定したら、処理をステップＳ１００に戻す。 If it is determined in step S105 described above that the score S (i) is not the maximum value, the process proceeds to step S108. In step S108, the occurrence section detection unit 104 refers to the condition (B) described above, and determines whether or not the score S (i) is a minimum value. If it is determined that the value is not the minimum value, the process returns to step S100.

一方、ステップＳ１０８で、スコアＳ(ｉ)が極小値であると判定したら、処理をステップＳ１０９に移行させる。ステップＳ１０９で、発生区間検出部１０４は、この極小値が、上述のステップＳ１０６で極大値が検出されてから初回に検出された極小値であるか否かを判定する。若し、極大値の検出後の初回に検出された極小値ではないと判定したら、処理をステップＳ１００に戻す。 On the other hand, if it is determined in step S108 that the score S (i) is a minimum value, the process proceeds to step S109. In step S109, the occurrence section detection unit 104 determines whether or not the local minimum value is a local minimum value detected for the first time after the local maximum value is detected in step S106 described above. If it is determined that it is not the minimum value detected for the first time after the detection of the maximum value, the process returns to step S100.

ステップＳ１０８で、発生区間検出部１０４は、スコアＳ(ｉ)がステップＳ１０６で極大値が検出されてから初回に検出された極小値であると判定したら、処理をステップＳ１１０に移行させ、フレームｉの略中央に不審音の波形の立ち下がりが検出されたものと見做す。このステップＳ１１０と、上述したステップＳ１０７とで不審音の波形の立ち上がりおよび立ち下がりが検出されたことになる。 In step S108, when the generation section detection unit 104 determines that the score S (i) is the minimum value detected for the first time after the maximum value is detected in step S106, the process shifts to step S110, and the frame i It is assumed that the falling edge of the suspicious sound waveform was detected in the approximate center of In step S110 and the above-described step S107, the rising and falling edges of the suspicious sound waveform are detected.

次のステップＳ１１１で、発生区間検出部１０４は、入力音響データ２０のサンプリング周波数、フレーム幅Ｗおよびフレーム間隔Ｄと、学習データについて信号パワー時系列の傾きを算出する際に用いたフレーム数とに基づき算出される遅延補正量を用いて、不審音の波形の立ち上がり位置および立ち下がり位置を補正する。この補正された立ち下がり位置および立ち下がり位置が、不審音発生区間の開始位置および終了位置とされ、不審音発生区間が検出される（ステップＳ１１２）。 In the next step S111, the generation section detection unit 104 sets the sampling frequency, the frame width W, and the frame interval D of the input acoustic data 20 and the number of frames used when calculating the slope of the signal power time series for the learning data. The rise position and fall position of the waveform of the suspicious sound are corrected using the delay correction amount calculated based on the delay correction amount. The corrected falling position and falling position are set as the start position and end position of the suspicious sound generation section, and the suspicious sound generation section is detected (step S112).

上述したように、本実施形態によれば、集音された観測音に基づく入力音響データから特徴値の時系列データを抽出し、抽出された特徴値と、予め学習データを用いて算出された特徴値の期待値とを比較してスコアを求め、このスコアの時系列上での変化に基づき不審音発生区間を検出している。そのため、雑音環境下においても、容易に不審音発生区間の検出を行うことができる。 As described above, according to the present embodiment, time-series data of feature values is extracted from input acoustic data based on collected observation sounds, and calculated using the extracted feature values and learning data in advance. The score is obtained by comparing the expected value of the feature value, and the suspicious sound generation interval is detected based on the change of the score over time. Therefore, it is possible to easily detect a suspicious sound generation section even in a noisy environment.

本実施形態では、不審音を検出するための閾値を、入力音響データの特徴値から算出したスコアに対して適用している。このスコアは、雑音環境下の観測音においても略一定の値を取る。そのため、監視エリアの環境に応じて閾値を変更する必要が無い。それと共に、本実施形態では、雑音に影響されにくい性質の値であるスコアを不審音の検出に用いているので、雑音に対して頑健な検出が可能で、雑音環境下や、雑音レベルの変化する環境下においても安定的に不審音発生区間の検出を行うことができる。 In the present embodiment, a threshold for detecting suspicious sound is applied to the score calculated from the feature value of the input acoustic data. This score takes a substantially constant value even in the observation sound in a noisy environment. Therefore, there is no need to change the threshold according to the environment of the monitoring area. At the same time, in the present embodiment, the score, which is a value that is not easily affected by noise, is used for detection of suspicious sound, so that robust detection against noise is possible, and the noise level and noise level change are possible. This makes it possible to stably detect the suspicious sound generation section even in an environment where the sound is generated.

ここで、本実施形態による不審音発生区間の検出方法が、雑音環境に対して頑健であることについて説明する。 Here, it will be described that the detection method of the suspicious sound generation section according to the present embodiment is robust against a noise environment.

本実施形態において、式（３）により算出されるスコアは、学習データと、入力音響データ２０との類似性が高い場合に大きな値を取る性質がある。一方、雑音（雑音の音響データ）と学習データとでは類似性が低いため、スコアは、略一定の値を取り大きく変化しない。したがって、スコアに対して閾値を設定し、スコアが閾値を超えたか否かを判定することで、従来の、例えば信号パワーに対して閾値を設定する音検出方法と比べて、より確実に物音の発生区間を検出することが可能となる。 In the present embodiment, the score calculated by the equation (3) has a property of taking a large value when the similarity between the learning data and the input acoustic data 20 is high. On the other hand, since the similarity between noise (noise acoustic data) and learning data is low, the score takes a substantially constant value and does not change significantly. Therefore, by setting a threshold value for the score and determining whether the score exceeds the threshold value, it is more reliable than a conventional sound detection method that sets a threshold value for signal power, for example. It is possible to detect the occurrence interval.

すなわち、不審音による音響データが入力された場合、入力された音響データにおける不審音発生区間の波形は、学習データによる波形に対する類似性が高いため、不審音発生区間の先頭でスコアが大きく変化する。本実施形態では、このスコアと閾値とを比較して、スコアが閾値を超えている場合に、不審音発生区間が検出されたものと判定する。 That is, when acoustic data based on suspicious sound is input, the waveform of the suspicious sound generation interval in the input acoustic data has a high similarity to the waveform based on the learning data, so the score greatly changes at the beginning of the suspicious sound generation interval. . In this embodiment, this score is compared with a threshold value, and when the score exceeds the threshold value, it is determined that a suspicious sound occurrence section has been detected.

一方、雑音による音響データが入力された場合、雑音による波形は、学習データによる波形に対する類似性が小さいため、スコアの変化が極めて小さい。そのため、スコアが閾値を超える可能性が小さく、誤検出の発生が抑制される。 On the other hand, when acoustic data due to noise is input, the waveform due to noise has a small similarity to the waveform due to learning data, so the change in score is extremely small. Therefore, the possibility that the score exceeds the threshold value is small, and the occurrence of erroneous detection is suppressed.

図１４は、学習データが物音（検出対象の不審音）を集音した音響データである場合に、略ランダムな音響データ（雑音による音響データ）が入力された際の、信号パワー、信号パワー時系列の傾きおよびスコアの例を示し、図１５は、図１４に例示した各項目の値をフレーム番号に対してプロットしたグラフを示す。学習データの期待値μ_kおよび分散値σ²は、上述した図７に示す値と同一とする。 FIG. 14 shows signal power and signal power when substantially random acoustic data (acoustic data due to noise) is input when the learning data is acoustic data obtained by collecting a physical sound (suspicious sound to be detected). FIG. 15 shows a graph in which the values of the items illustrated in FIG. 14 are plotted against the frame numbers. The expected value μ _k and the variance value σ ² of the learning data are the same as the values shown in FIG.

なお、図１５に例示されるグラフは、後述するスコア算出に伴う遅延の補正がなされておらず、スコアのプロットは、信号パワーのプロットに対して、３フレーム分遅延している。 In the graph illustrated in FIG. 15, delay correction associated with score calculation described later is not performed, and the score plot is delayed by three frames with respect to the signal power plot.

図１５に例示されるように、雑音による音響データは、信号パワーが比較的大きく変化する。そのため、従来のように信号パワーに対して閾値２００を設定した場合、フレーム番号＃３、＃１６、＃１８および＃１９などが誤検出されてしまう。一方、波形が学習データと大きく異なる音響データが入力された場合、スコアは大きく変化しない。そのため、本実施形態によりスコアに対して閾値を設定した場合、スコアが当該閾値を超えてしまう可能性が低く、雑音環境下においても特定の物音の発生区間を正確に検出することができる。 As illustrated in FIG. 15, the signal power of acoustic data due to noise changes relatively large. Therefore, when the threshold value 200 is set for the signal power as in the prior art, frame numbers # 3, # 16, # 18, and # 19 are erroneously detected. On the other hand, when acoustic data whose waveform is significantly different from the learning data is input, the score does not change greatly. Therefore, when a threshold value is set for a score according to the present embodiment, the possibility that the score exceeds the threshold value is low, and a specific sound generation interval can be accurately detected even in a noisy environment.

また、本実施形態は、不審音の特徴を抽出することで、不審音発生区間の検出を行っている。そのため、音声のみならず、様々な物音を不審音発生区間の検出対象とすることができる。 In the present embodiment, the suspicious sound generation section is detected by extracting the characteristics of the suspicious sound. For this reason, not only voice but also various kinds of sound can be detected in the suspicious sound generation section.

さらに、本実施形態を適用することで、不審音の発生区間を精度よく検出することができる。これにより、検出された不審音発生区間の音響データに対する音認識処理における精度の向上が期待できる。また、不審音発生区間の検出を音認識処理に対して事前に行うことにより、システムは、検出された不審音発生区間の音響データのみに対して認識処理を行えばよく、音認識システム全体の計算コストや音響データの通信を行うための通信コストを抑えることができる。 Furthermore, by applying the present embodiment, it is possible to accurately detect a suspicious sound generation interval. Thereby, the improvement in the precision in the sound recognition process with respect to the acoustic data of the detected suspicious sound generation area can be expected. In addition, by detecting the suspicious sound generation section in advance for the sound recognition process, the system may perform the recognition process only on the acoustic data of the detected suspicious sound generation section, and the entire sound recognition system It is possible to reduce the calculation cost and the communication cost for performing acoustic data communication.

本実施形態の音検出装置を、監視領域内で不審者を検出した場合に警報を出力する警備装置に設けたり、本実施形態の音検出装置からの出力を上記警備装置に入力するように構成することができる。これにより、監視領域内での不審音を容易かつ高精度に検出できるので、警備装置による誤報を防止することが可能となる。 The sound detection device according to the present embodiment is provided in a security device that outputs an alarm when a suspicious person is detected in the monitoring area, or the output from the sound detection device according to the present embodiment is input to the security device. can do. Thereby, since the suspicious sound in a monitoring area | region can be detected easily and with high precision, it becomes possible to prevent the false alarm by a security device.

１０マイクロフォン
１１Ａ／Ｄ変換部
１２演算部
１３記憶部
２０入力音響データ
２１出力音響データ
１００音検出装置
１０１信号パワー算出部
１０２傾き算出部
１０３スコア算出部
１０４発生区間検出部
１１１スコアパラメータ DESCRIPTION OF SYMBOLS 10 Microphone 11 A / D conversion part 12 Calculation part 13 Storage part 20 Input acoustic data 21 Output acoustic data 100 Sound detection apparatus 101 Signal power calculation part 102 Inclination calculation part 103 Score calculation part 104 Generation | occurrence | production area detection part 111 Score parameter

Claims

A feature value calculating means for calculating a feature value indicating a feature of the sound data along a time series;
The sound data is evaluated based on the difference between the expected value of the characteristic value obtained in advance from the learning data along the time series and the characteristic value of the signal power time series of the sound data calculated by the characteristic value calculating means. Score calculating means for calculating a score to be performed;
And detecting means for detecting a position of a maximum value and a position of a minimum value of the score, and detecting a specific sound generation section in the sound data based on the position of the maximum value and the position of the minimum value. Sound detection device.

The detection means includes
When the maximum value exceeds a threshold, the position of the maximum value is determined as the start position of the specific sound generation section,
The sound detection device according to claim 1, wherein the position of the minimum value that first appears after the start position is determined as the end position of the specific sound generation section.

The sound detection device according to claim 1, wherein the expected value of the feature value is obtained by using a plurality of sound data of the same type and different sounds as the learning data.

The expected value of the characteristic value of the signal power time series is obtained for each of a plurality of predetermined ranges arranged in a time series order with some overlapping learning data,
The feature value calculating means includes
The feature value is calculated for a predetermined range arranged in chronological order with a part of the sound data overlapping,
The score calculation means includes
The score is calculated based on a sum of differences between feature values of a plurality of predetermined ranges of the sound data and the expected values of the plurality of predetermined ranges obtained for the learning data. The sound detection device according to any one of claims 3 to 4.

The score calculation means includes
The sound detection apparatus according to claim 4, wherein the score is calculated by normalizing the difference using a variance value of characteristic values of signal power time series obtained in advance from learning data.

The sound detection apparatus according to claim 1, wherein the feature value is a slope of a signal power time series of sound data.

The delay correction means for correcting a delay generated when the score calculation means calculates the score for the specific sound generation section detected by the detection means. Item 7. The sound detection device according to any one of items 6 to 6.

A feature value calculating means for calculating a feature value indicating the feature of the sound data along a time series;
The score calculation means is based on the difference between the expected value of the feature value obtained in advance along the time series from the learning data and the feature value of the signal power time series of the sound data calculated in the feature value calculating step. A score calculating step for calculating a score for evaluating the sound data;
Detecting a position of a maximum value and a position of a minimum value of the score, and detecting a specific sound generation section in the sound data based on the position of the maximum value and the position of the minimum value, Sound detection method.