JP6257537B2

JP6257537B2 - Saliency estimation method, saliency estimation device, and program

Info

Publication number: JP6257537B2
Application number: JP2015007718A
Authority: JP
Inventors: 惇米家; 茂人古川; 牧夫柏野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-01-19
Filing date: 2015-01-19
Publication date: 2018-01-10
Anticipated expiration: 2035-01-19
Also published as: JP2016133600A

Description

本発明は、入力信号の一部の時間区間の音の目立ち度合いを推定する顕著度推定方法、顕著度推定装置、プログラムに関する。 The present invention relates to a saliency estimation method, a saliency estimation apparatus, and a program for estimating the degree of conspicuousness of sound in a part of a time interval of an input signal.

従来、音のスペクトル構造に基づいて、時間周波数領域における目立ち度合いを計算するモデル（auditory saliency map）が提案されている（非特許文献１参照）。 Conventionally, a model (auditory saliency map) for calculating the degree of conspicuousness in the time-frequency domain based on the spectral structure of sound has been proposed (see Non-Patent Document 1).

C. Kayser, C. I. Petkov, M. Lippert, N. K. Logothetis, "Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map", Current Biology, 2005, Volume 15, Issue 21, Pages 1943-1947.C. Kayser, C. I. Petkov, M. Lippert, N. K. Logothetis, "Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map", Current Biology, 2005, Volume 15, Issue 21, Pages 1943-1947.

Auditory Saliency Mapによれば、音のスペクトログラムに対し、強度とその時間周波数的なコントラストを計算することで、特定の時間周波数における音の目立ちやすさを推定することができる。 According to the Auditory Saliency Map, it is possible to estimate the conspicuousness of a sound at a specific time frequency by calculating the intensity and its time-frequency contrast with respect to the sound spectrogram.

このモデルでは音のスペクトル構造に基づいた計算を行うため、ある特定の音がどのような文脈で提示されたかに関わらず、同じ音に対しては同程度の顕著性が評価される。従って、同じ音であっても、予想外のタイミングでの呈示によって顕著性が増加する様子など、時系列的なパターンや文脈に基づく顕著性を十分に表現することはできない。 Since this model performs calculations based on the spectral structure of the sound, the same degree of saliency is evaluated for the same sound regardless of the context in which a specific sound is presented. Therefore, even if the sound is the same, it is not possible to sufficiently express the saliency based on the time-series pattern or context, such as how the saliency increases due to the presentation at an unexpected timing.

そこで本発明は、時系列的なパターンの予測不可能性に基づく対象音の目立ち度合いを推定できる顕著度推定方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a saliency estimation method capable of estimating the degree of conspicuousness of a target sound based on the unpredictability of a time-series pattern.

本発明は、入力信号の一部の時間区間の音の目立ち度合いを推定する顕著度推定方法であって、類似部分区間検出ステップと、予測分布生成ステップと、顕著度推定ステップを含む。入力信号の時間区間のうち、予測区間を、音の目立ち度合いである顕著度の推定対象とする時間区間とし、参照区間を、前記予測区間の直前にあって所定の時間幅を有する時間区間とし、蓄積区間を、前記参照区間よりも前の時間区間とし、推定対象信号を、予測区間に対応する入力信号の特徴量とし、参照信号を、参照区間に対応する入力信号の特徴量とする。 The present invention is a saliency estimation method for estimating the degree of conspicuousness of sound in a part of a time interval of an input signal, and includes a similar partial interval detection step, a predicted distribution generation step, and a saliency estimation step. Among the time intervals of the input signal, the prediction interval is a time interval that is a target of estimation of the saliency that is the degree of conspicuousness of the sound, and the reference interval is a time interval that has a predetermined time width immediately before the prediction interval. The accumulation interval is a time interval before the reference interval, the estimation target signal is the feature amount of the input signal corresponding to the prediction interval, and the reference signal is the feature amount of the input signal corresponding to the reference interval.

類似部分区間検出ステップは、参照信号と類似する部分区間である類似部分区間を蓄積区間内から１つ以上検出する。予測分布生成ステップは、検出された類似部分区間の直後にある所定の時間区間に対応する入力信号の特徴量に基づく分布である予測分布を１つ以上生成する。顕著度推定ステップは、生成された予測分布と推定対象信号に基づいて、予測区間に対応する入力信号の顕著度を推定する。 The similar partial section detection step detects one or more similar partial sections that are similar to the reference signal from the accumulation section. The prediction distribution generation step generates one or more prediction distributions that are distributions based on the feature amount of the input signal corresponding to a predetermined time interval immediately after the detected similar partial interval. The saliency estimation step estimates the saliency of the input signal corresponding to the prediction interval based on the generated prediction distribution and the estimation target signal.

本発明の顕著度推定方法によれば、時系列的なパターンの予測不可能性に基づく対象音の目立ち度合いを推定できる。 According to the saliency estimation method of the present invention, it is possible to estimate the degree of conspicuousness of the target sound based on the unpredictability of a time-series pattern.

実施例１の顕著度推定装置の構成を示すブロック図。1 is a block diagram illustrating a configuration of a saliency estimating apparatus according to Embodiment 1. FIG. 実施例１の顕著度推定装置の動作を示すフローチャート。3 is a flowchart showing the operation of the saliency estimating apparatus according to the first embodiment. 蓄積区間、参照区間、予測区間、蓄積部分区間、類似部分区間の定義について説明する図。The figure explaining the definition of an accumulation area, a reference area, a prediction area, an accumulation | storage partial area, and a similar partial area. 実際の音楽に対して蓄積区間、参照区間、予測区間を設定して類似部分区間を求めた例を示す図。The figure which shows the example which calculated | required the similar partial area by setting an accumulation area, a reference area, and a prediction area with respect to actual music. 実際の音楽に基づく２種類の波形のそれぞれに対して顕著度を求めた例を示す図。The figure which shows the example which calculated | required saliency with respect to each of two types of waveforms based on actual music.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下、図１、図２を参照して音響信号の時系列的なパターンの予測不可能性に基づく音の目立ち度合いを推定する実施例１の顕著度推定装置について説明する。なお、音の目立ち度合いを顕著度ともいう。図１は、本実施例の顕著度推定装置１の構成を示すブロック図である。図２は、本実施例の顕著度推定装置１の動作を示すフローチャートである。本実施例の顕著度推定装置１は、特定の音響信号である入力信号の特定の時間区間における顕著度を評価する装置であり、具体的にはＣＰＵ（Central Processing Unit）やメモリを有する一般的な計算機上に実現される。図１に示すように、本実施例の顕著度推定装置１は、特徴量計算部１１と、区間類似度計算部１２と、類似部分区間検出部１３と、予測分布生成部１４と、顕著度推定部１５を含む。以下、それぞれの構成要件の動作について詳細に説明する。 Hereinafter, a saliency estimation apparatus according to Embodiment 1 that estimates the degree of conspicuousness of sound based on the unpredictability of a time-series pattern of an acoustic signal will be described with reference to FIGS. 1 and 2. Note that the degree of conspicuousness of sound is also referred to as saliency. FIG. 1 is a block diagram showing the configuration of the saliency estimating apparatus 1 of this embodiment. FIG. 2 is a flowchart showing the operation of the saliency estimating apparatus 1 of the present embodiment. The saliency estimating apparatus 1 of the present embodiment is an apparatus that evaluates saliency in a specific time section of an input signal that is a specific acoustic signal, and more specifically, a general having a CPU (Central Processing Unit) and a memory. Realized on a simple computer. As shown in FIG. 1, the saliency estimating apparatus 1 of this embodiment includes a feature amount calculation unit 11, a section similarity calculation unit 12, a similar partial section detection unit 13, a predicted distribution generation unit 14, and a saliency level. An estimation unit 15 is included. Hereinafter, the operation of each component will be described in detail.

＜特徴量計算部１１（ステップＳ１１）＞
特徴量計算部１１は、入力信号全体にわたる時間周波数解析を行うことでフレーム毎の特徴ベクトルを計算して出力する（Ｓ１１）。 <Feature Quantity Calculation Unit 11 (Step S11)>
The feature quantity calculator 11 calculates and outputs a feature vector for each frame by performing a time-frequency analysis over the entire input signal (S11).

例えば特徴量計算部１１は、入力信号全体を、聴覚フィルタに基づくパワースペクトログラムに変換する。聴覚フィルタは、中心周波数が連続的に変化する帯域フィルタ群である。特徴量計算部１１は、各フィルタが帯域幅の異なるバンドパスフィルタ（ここではガンマトーンフィルタ）として、信号音に対する周波数分析を行う。 For example, the feature amount calculation unit 11 converts the entire input signal into a power spectrogram based on an auditory filter. The auditory filter is a band filter group whose center frequency continuously changes. The feature amount calculation unit 11 performs frequency analysis on the signal sound, with each filter serving as a bandpass filter (here, a gamma tone filter) having a different bandwidth.

聴覚フィルタの総数がＣのとき、各フィルタの中心周波数ｆ_ｋ（ｋ＝１，２，…，Ｃ、ｆ_ｋはｋ番目のフィルタの中心周波数を表す）は、対象とする周波数帯の最大周波数（通常、ナイキスト周波数）をｆ_ｍａｘ、最小周波数をｆ_ｍｉｎとすると、 When the total number of auditory filters is C, the center frequency f _k (k = 1, 2,..., C, f _k represents the center frequency of the _kth filter) of each filter is the maximum frequency of the target frequency band. (Usually the Nyquist frequency) is f _max and the minimum frequency is f _min .

と表される。ただし、Ｑ_ｅは高周波数帯におけるＱ値（＝中心周波数/帯域幅）の漸近値、ｗ_０は低周波数帯における帯域幅の下限値を表し、いずれも定数である。このとき、各フィルタの帯域幅ＥＲＢ_ｋは、 It is expressed. However, Q _e represents an asymptotic value of the Q value (= center frequency / bandwidth) in the high frequency band, and w ₀ represents a lower limit value of the bandwidth in the low frequency band, both of which are constants. At this time, the bandwidth ERB _k of each filter is

と表される。得られるフィルタ出力は信号音と同じサンプリング周波数を持つが、次元削減とノイズ低減の目的から、フレーム幅ΔＴ（数ｍｓ程度）、区間幅ΔＴ／２毎程度に移動平均をとることで、フレーム単位への変換を行う。以下では、フレーム番号ｔにおけるｉ番目（ｉ＝１，２，…，Ｃ）の聴覚フィルタ出力をｐ_ｉ［ｔ］と表し、その時点における聴覚フィルタ全体の出力（Ｃ次元ベクトル）を、 It is expressed. The obtained filter output has the same sampling frequency as that of the signal sound, but for the purpose of dimension reduction and noise reduction, by taking a moving average for each frame width ΔT (about several ms) and section width ΔT / 2, it is a frame unit. Convert to. In the following, the i-th (i = 1, 2,..., C) auditory filter output at frame number t is expressed as p _i [t], and the output (C-dimensional vector) of the entire auditory filter at that time point is

と表す。以下、ｐ［ｔ］をフレーム番号ｔにおける特徴ベクトルとよぶ。特徴量計算部１１は、ＮＭＦ（非負値行列因子分解）などの手法によって特徴ベクトルを計算してもよい。また、特徴量計算部１１は単純なスペクトログラムによって特徴ベクトルを計算してもよい。 It expresses. Hereinafter, p [t] is referred to as a feature vector at frame number t. The feature quantity calculator 11 may calculate the feature vector by a technique such as NMF (non-negative matrix factorization). Further, the feature quantity calculation unit 11 may calculate a feature vector by a simple spectrogram.

＜区間類似度計算部１２（ステップＳ１２）＞
以下、図３を参照して入力信号の各区間について定義する。図３は、蓄積区間、参照区間、予測区間、蓄積部分区間、類似部分区間の定義について説明する図である。図３に示すように、予測区間を顕著度の推定対象とする時間区間、参照区間を予測区間の直前にあって所定の時間幅を有する時間区間、蓄積区間を参照区間よりも前の時間区間（時間区間全体）、蓄積部分区間を蓄積区間の一部の区間であって参照区間と同じ時間幅を有する時間区間と定義する。また、推定対象信号を予測区間に対応する入力信号の特徴量（特徴量全体）、参照信号を参照区間に対応する入力信号の特徴量（特徴量全体）、蓄積信号を蓄積区間に対応する入力信号の特徴量（特徴量全体）、蓄積部分信号を蓄積部分区間に対応する入力信号の特徴量（特徴量全体）と定義する。 <Section Similarity Calculation Unit 12 (Step S12)>
Hereinafter, each section of the input signal will be defined with reference to FIG. FIG. 3 is a diagram for explaining the definition of the accumulation interval, the reference interval, the prediction interval, the accumulation partial interval, and the similar partial interval. As shown in FIG. 3, a time interval in which a prediction interval is a saliency estimation target, a reference interval immediately before the prediction interval and a predetermined time width, and an accumulation interval before the reference interval (Overall time interval), the accumulation partial interval is defined as a time interval that is a partial interval of the accumulation interval and has the same time width as the reference interval. In addition, the feature quantity of the input signal corresponding to the prediction section (entire feature quantity) as the estimation target signal, the feature quantity of the input signal (entire feature quantity) corresponding to the reference section as the reference signal, and the input corresponding to the accumulation section as the accumulation signal The feature amount of the signal (entire feature amount) and the accumulated partial signal are defined as the feature amount (entire feature amount) of the input signal corresponding to the accumulated partial interval.

区間類似度計算部１２は、ステップＳ１１で計算された特徴ベクトルに基づいて蓄積信号の蓄積部分信号と参照信号の類似度である区間類似度を計算する（Ｓ１２）。 The section similarity calculation unit 12 calculates a section similarity, which is the similarity between the stored partial signal of the stored signal and the reference signal, based on the feature vector calculated in step S11 (S12).

以下では図３に示すように、蓄積区間のフレーム番号を１からＴ、参照区間のフレーム番号をＴ＋１からＴ＋Ｎ、予測区間のフレーム番号をＴ＋Ｎ＋１からＴ＋Ｎ＋Ｍとする。ただし、蓄積区間は参照区間よりも長い必要があるため、Ｔ＞Ｎとする。 In the following, as shown in FIG. 3, the frame number of the accumulation section is 1 to T, the frame number of the reference section is T + 1 to T + N, and the frame number of the prediction section is T + N + 1 to T + N + M. However, since the accumulation interval needs to be longer than the reference interval, T> N.

区間類似度計算部１２は、特徴ベクトルの全てのフレーム同士の組み合わせについてパワースペクトル密度の角度を計算し、フレーム間の類似度として定義する。フレーム番号ｔ_１とｔ_２の特徴ベクトル間の類似度Ｓ（ｔ_１，ｔ_２）は、 The interval similarity calculation unit 12 calculates the angle of the power spectral density for all the combinations of frames of the feature vector, and defines it as the similarity between frames. The similarity S (t ₁ , t ₂ ) between the feature vectors of frame numbers t ₁ and t ₂ is

と計算される。 Is calculated.

蓄積区間内において参照区間と類似する部分区間を抽出するため、区間類似度計算部１２は、上記で計算されたフレーム単位の類似度に基づいて、参照区間と同じ長さ（フレーム数Ｎ）の単位での類似度を計算する。参照区間と、蓄積部分区間（フレーム番号ｔからｔ＋Ｎ−１）との区間類似度ＳＩＭ（ｔ）は、 In order to extract a partial section similar to the reference section in the accumulation section, the section similarity calculation unit 12 has the same length (number of frames N) as the reference section based on the similarity in frame units calculated above. Calculate the similarity in units. The section similarity SIM (t) between the reference section and the storage partial section (frame number t to t + N−1) is:

と計算される。ＳＩＭ（ｔ）は、１≦t≦Ｔにわたって計算される。区間類似度の計算は、例えば特許第４３２７２０２号明細書に記載の方法によってもよい。 Is calculated. SIM (t) is calculated over 1 ≦ t ≦ T. The calculation of the interval similarity may be performed by a method described in Japanese Patent No. 4327202, for example.

＜類似部分区間検出部１３（ステップＳ１３）＞
類似部分区間検出部１３は、区間類似度計算部１２で出力された区間類似度に基づいて蓄積区間内から参照信号と類似する部分区間である類似部分区間を１つ以上検出する（Ｓ１３）。類似部分区間検出部１３は、１≦t≦Ｔにおいて、区間類似度ＳＩＭ（ｔ）の値が高いフレーム番号を、上から順にＤ個抽出する。Ｄは定数とし、数個〜数十個程度を抽出することが望ましい。 <Similar Partial Section Detection Unit 13 (Step S13)>
The similar partial section detection unit 13 detects one or more similar partial sections, which are partial sections similar to the reference signal, from the storage section based on the section similarity output by the section similarity calculation unit 12 (S13). The similar partial section detection unit 13 extracts D frame numbers having a high section similarity SIM (t) value in order from the top in 1 ≦ t ≦ T. D is a constant and it is desirable to extract several to several tens.

実際には、類似度の高い部分区間が特定の時間範囲に重畳して集中する可能性があるため、一定の閾値を設定し、閾値の範囲内で隣り合うフレーム番号を排除し、代表として部分区間を一つ抽出することもできる。このように検出されたＤ個の部分区間を、類似部分区間とよぶ。類似部分区間検出は、例えば特許第４３２７２０２号明細書に記載の方法によってもよい。 Actually, there is a possibility that partial sections with high similarity overlap and concentrate on a specific time range, so set a certain threshold, exclude adjacent frame numbers within the threshold range, and One section can also be extracted. The D partial sections thus detected are referred to as similar partial sections. The similar partial section detection may be performed by a method described in, for example, Japanese Patent No. 4327202.

図４に、実際の音楽に対して蓄積区間、参照区間、予測区間を設定して類似部分区間を求めた例を示す。図４の例では、５〜６秒付近で設定された参照区間の参照信号と類似する部分区間として、蓄積区間内の１〜２秒付近（類似部分区間１）、３〜４秒付近（類似部分区間２）が検出されている。 FIG. 4 shows an example in which similar partial sections are obtained by setting a storage section, a reference section, and a prediction section for actual music. In the example of FIG. 4, as a partial section similar to the reference signal of the reference section set in the vicinity of 5 to 6 seconds, about 1 to 2 seconds (similar partial section 1) in the accumulation section, 3 to 4 seconds (similar) Partial section 2) has been detected.

＜予測分布生成部１４（ステップＳ１４）＞
予測分布生成部１４は、検出された類似部分区間の直後にある所定の時間区間に対応する入力信号の特徴量に基づいて予測区間の特徴ベクトルを予測する分布である予測分布を生成する（Ｓ１４）。 <Prediction distribution generation unit 14 (step S14)>
The prediction distribution generation unit 14 generates a prediction distribution that is a distribution for predicting the feature vector of the prediction section based on the feature amount of the input signal corresponding to the predetermined time section immediately after the detected similar partial section (S14). ).

具体的には予測分布生成部１４は、予測区間のフレーム毎に、Ｃ次元特徴ベクトルの要素毎の予測分布を生成する。予測分布は、類似部分区間検出部１３で検出されたＤ個の類似部分区間それぞれに基づいて計算されたＤ個の分布として出力される。 Specifically, the prediction distribution generation unit 14 generates a prediction distribution for each element of the C-dimensional feature vector for each frame in the prediction section. The predicted distribution is output as D distributions calculated based on each of the D similar partial sections detected by the similar partial section detection unit 13.

予測区間のフレーム番号Ｔ＋Ｎ＋ｔにおいて、開始フレーム番号をＬ_ｄとする類似部分区間から生成される予測分布は、平均μ_ｄ、分散共分散行列Σ_ｄを下記の式で表す多次元正規分布Ｎ（μ_ｄ，Σ_ｄ）とする。 In the frame number T + N + t of the prediction interval, the prediction distribution generated from the similar partial interval having the start frame number L _d is a multidimensional normal distribution N (μ that represents the mean μ _d and the variance-covariance matrix Σ _d by the following equation: _d, and Σ _d).

ただしｖ_ｄは各聴覚フィルタに定義される分散値で、 Where v _d is the variance defined for each auditory filter,

とする。σ^２は、蓄積区間における各要素の分散値を並べたＣ次元ベクトルで、類似度に依らず一定値をとる。第二項は、類似度に依存する係数を表す。類似度が１に近づく場合、分散値は０へ近づき、予測は高い確率で平均値の周辺に集中する。類似度が０の場合、分散値は（ｔ＝０の場合）蓄積区間における分散値σ^２と一致する。これは、予測分布と観測値の差分（絶対値）の期待値が、予測が全くあてはまらない場合、過去の時系列における任意の２点間の差分（絶対値）の期待値、すなわち時系列の標準偏差に等しいことに依拠する。第三項は、時間減衰を表す。時間の経過とともに分散値が指数関数的に増加し、予測分布が一様分布に近づくことを意味する。γは減衰の程度を表す定数で、任意の正の値に設定できる。 And σ ² is a C-dimensional vector in which the variance values of each element in the accumulation interval are arranged, and takes a constant value regardless of the similarity. The second term represents a coefficient that depends on the similarity. When the similarity degree approaches 1, the variance value approaches 0, and the prediction concentrates around the average value with a high probability. When the similarity is 0, the variance value (when t = 0) matches the variance value σ ² in the accumulation interval. This is because the expected value of the difference (absolute value) between the predicted distribution and the observed value is the expected value of the difference (absolute value) between any two points in the past time series, i.e. Rely on being equal to the standard deviation. The third term represents time decay. The variance value increases exponentially with the passage of time, which means that the predicted distribution approaches a uniform distribution. γ is a constant representing the degree of attenuation and can be set to any positive value.

上記の式からわかる通り、予測分布の平均値は、類似部分区間の直後Ｍフレームにおける入力信号の振る舞いと一致する。予測分布の分散値は、各類似部分区間がもつ参照区間との類似度ＳＩＭ（ｄ）に依存して変化する。なお、上記の分散の定義に追加する形で、予測分布の分散値が類似部分区間と参照区間の時間差に応じて増加するような項を積算し、予測に関する「忘却」の要素を考慮することもできる。 As can be seen from the above equation, the average value of the prediction distribution matches the behavior of the input signal in the M frame immediately after the similar partial section. The variance value of the prediction distribution changes depending on the similarity SIM (d) with the reference section of each similar partial section. In addition to the above definition of variance, add the terms that increase the variance value of the predicted distribution according to the time difference between the similar partial interval and the reference interval, and consider the “forgetting” factor related to prediction. You can also.

＜顕著度推定部１５（ステップＳ１５）＞
顕著度推定部１５は、予測分布生成部１４で生成された予測分布と、実際の推定対象信号の比較に基づいて、予測区間に対応する音響信号（入力信号）の目立ち度合い（顕著度）を推定する（Ｓ１５）。 <Saliency estimation unit 15 (step S15)>
The saliency estimation unit 15 determines the degree of conspicuity (saliency) of the acoustic signal (input signal) corresponding to the prediction section based on the comparison between the prediction distribution generated by the prediction distribution generation unit 14 and the actual estimation target signal. Estimate (S15).

具体的には顕著度推定部１５は、予測分布生成部１４で出力されたＤ個の予測分布中において推定対象信号が出現する確率に基づいて、それぞれの分布に基づく顕著度の要素を計算する。顕著度の要素について、それぞれの類似部分区間と参照区間との類似度の値に応じた加重平均をとることで、顕著度を出力する。 Specifically, the saliency estimation unit 15 calculates elements of saliency based on the respective distributions based on the probability that the estimation target signal appears in the D prediction distributions output from the prediction distribution generation unit 14. . For the saliency element, the saliency is output by taking a weighted average corresponding to the similarity value between each similar partial section and the reference section.

推定対象信号ｐ［Ｔ＋Ｎ＋ｔ］について、ｄ番目の予測分布（開始フレーム番号Ｌ_ｄの類似部分区間に基づく）に対する顕著度の要素は、それぞれ Elements of saliency for (like parts based on interval of the start frame number L _d) estimated for the target signal p [T + N + t] , d -th prediction distribution, respectively

として計算される。上記の値について、ｄ番目の類似部分区間と参照区間との類似度に応じた加重平均を取ることで、顕著度ｚ［Ｔ＋Ｎ＋ｔ］を定義する。 Is calculated as With respect to the above values, the saliency z [T + N + t] is defined by taking a weighted average corresponding to the similarity between the d-th similar partial section and the reference section.

ただしＡは正規化係数で、 Where A is a normalization factor,

である。 It is.

ここまでは予測区間の開始フレーム番号をＴ＋Ｎ＋１に固定していたが、たとえば入力信号のある範囲全体にわたって予測区間の開始フレーム番号を変化させながら顕著度の計算を行い、顕著度の累積値をとることで、顕著度と定義してもよい。 Up to this point, the start frame number of the prediction interval has been fixed to T + N + 1. For example, the saliency is calculated while changing the start frame number of the prediction interval over the entire range of the input signal, and the accumulated value of the saliency is obtained. Therefore, the degree of saliency may be defined.

図５に、実際の音楽に対して顕著度を求めた例を示す。図５のＡ、Ｂは、縦軸を信号強度、横軸を時間として、同一の楽音信号をそれぞれ１度目、２度目に聴取した例を示したものである。図５のＣ、Ｄは、縦軸を図５のＡ、Ｂに示す楽音信号それぞれに対する顕著度、横軸を時間として、２種類の楽音信号の顕著度の推移を例示したものである。縦線が示す時間より先（約５．８ｓより先）が予測対象区間、縦線が示す時間よりも前の時間区間（約５．８ｓより前）が参照区間（および蓄積区間）を表す。図５のＡ、Ｂにおける実線の波形は実際の楽音信号で、点線は予測された楽音信号（予測分布の平均値）を表す。一度目の聴取では、予測対象区間での楽音信号を正しく予測できておらず、図５のＡ右上のようなパターン（蓄積区間中に繰り返し現れていたパターン）を予測しており、図５のＣに示すように顕著度が比較的大きな値として計算されている。二度目の聴取では、蓄積信号の中に予測対象区間での楽音信号とほとんど同一の信号が含まれているため、図５のＢ右上のようなパターンを正しく予測しており、図５のＤに示すように、顕著度が比較的小さな値として計算されている。 FIG. 5 shows an example in which the saliency is obtained for actual music. 5A and 5B show examples in which the same musical tone signal is listened for the first time and the second time, respectively, with the vertical axis representing signal intensity and the horizontal axis representing time. C and D in FIG. 5 exemplify the transition of the saliency of the two types of music signals, with the ordinate representing the saliency with respect to each of the tone signals shown in FIGS. 5A and 5B and the abscissa representing the time. The time before the time indicated by the vertical line (before about 5.8 s) represents the prediction target section, and the time period before the time indicated by the vertical line (before about 5.8 s) represents the reference section (and the accumulation section). The solid line waveforms in FIGS. 5A and 5B are actual musical tone signals, and the dotted line represents the predicted musical tone signal (average value of the predicted distribution). In the first listening, the musical tone signal in the prediction target section is not correctly predicted, and a pattern as shown in the upper right of FIG. 5A (a pattern that repeatedly appeared in the accumulation section) is predicted. As shown in C, the saliency is calculated as a relatively large value. In the second listening, since the stored signal contains almost the same signal as the musical tone signal in the prediction target section, the pattern as shown in the upper right of FIG. 5B is correctly predicted. As shown in FIG. 5, the saliency is calculated as a relatively small value.

＜効果＞
本実施例の顕著度推定装置１によれば、上記構成によって特定の音響信号である入力信号の特定の時間区間における顕著度を評価することができる。 <Effect>
According to the saliency estimating apparatus 1 of the present embodiment, the saliency in a specific time section of an input signal that is a specific acoustic signal can be evaluated by the above configuration.

非特許文献１によれば、音のスペクトル特性に基づいて、聴覚刺激の顕著性を時系列的に評価することができる。しかし、このモデルでは短い時間窓での音のスペクトル構造に基づいた計算が行われるため、同じ音に対しては文脈に関わらず同じ顕著性が評価される。従って、同じ音であっても、予想外のタイミングでの呈示による顕著性の増加など、時間的なパターン変化に基づく顕著性の変化を十分に表現することはできない。本手法では時系列的なパターンの予測不可能性に基づいた計算が行われるため、音楽などのように繰り返しと逸脱から構成される音響信号に対して、パターン変化に基づく音の目立ち度合いを推定することができる。 According to Non-Patent Document 1, saliency of auditory stimulation can be evaluated in time series based on the spectral characteristics of sound. However, in this model, calculations based on the spectral structure of the sound in a short time window are performed, so the same saliency is evaluated for the same sound regardless of the context. Therefore, even for the same sound, a change in saliency based on a temporal pattern change, such as an increase in saliency due to presentation at an unexpected timing, cannot be expressed sufficiently. Since the calculation based on the unpredictability of time-series patterns is performed in this method, the degree of conspicuousness of sounds based on pattern changes is estimated for acoustic signals composed of repetitions and deviations such as music. can do.

時系列信号に対して、過去の情報に基づいて特定の時点からの将来値を予測する場合、ARモデルのような時系列予測法を用いることが考えられる。しかし、ARモデルのように定常過程を対象とした時系列予測法では、全ての過去が均等にモデルの生成に寄与するため、特定の過去のパターンを再現することは難しい。本手法では、所定の時点からの時系列発展が、直近までの信号のパターンと類似する過去と同様の振る舞いをすると仮定し、また、振る舞いをする確かさ（確率分布の分散）が、その過去との類似度に相関するようなモデルとなっている。これにより、音楽のように複雑な非定常過程となる時系列信号についても、統計的なパターンの予測を行うことが可能となる。 When a future value from a specific time point is predicted based on past information for a time series signal, it is conceivable to use a time series prediction method such as an AR model. However, in the time series prediction method for steady processes as in the AR model, it is difficult to reproduce a specific past pattern because all past contributes equally to generation of the model. In this method, it is assumed that the time series development from a given point of time behaves in the same way as the past signal pattern, and the certainty of behavior (variance of probability distribution) is the past. The model correlates with the similarity. As a result, a statistical pattern can be predicted for a time-series signal that is a complex non-stationary process such as music.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A saliency estimation method for estimating the degree of conspicuousness of sound in a part of a time interval of an input signal,
Of the time interval of the input signal,
Let the prediction interval be the time interval for which the saliency, which is the conspicuousness of the sound, is to be estimated,
The reference interval is a time interval immediately before the prediction interval and having a predetermined time width,
The accumulation interval is a time interval before the reference interval,
The estimation target signal is a feature amount of the input signal corresponding to the prediction interval,
Let the reference signal be the feature quantity of the input signal corresponding to the reference section,
A similar partial section detecting step of detecting one or more similar partial sections that are similar to the reference signal from the accumulation section;
A prediction distribution generation step of generating one or more prediction distributions that are distributions based on the feature amount of the input signal corresponding to a predetermined time interval immediately after the detected similar partial interval;
A saliency estimation step of estimating the saliency of the input signal corresponding to the prediction interval based on the generated prediction distribution and the estimation target signal;
A saliency estimation method including:

The saliency estimation method according to claim 1,
The saliency estimation step includes:
A saliency estimation method for estimating the saliency based on an appearance probability that the estimation target signal appears in one or more prediction distributions generated in the prediction distribution generation step.

The saliency estimation method according to claim 2,
A saliency estimation method for changing a saliency based on an appearance probability that the estimation target signal appears according to a similarity between the similar partial section and the reference section.

A saliency estimation device for estimating the degree of conspicuousness of sound in a part of a time interval of an input signal,
Of the time interval of the input signal,
Let the prediction interval be the time interval for which the saliency, which is the conspicuousness of the sound, is to be estimated,
The reference interval is a time interval immediately before the prediction interval and having a predetermined time width,
The accumulation interval is a time interval before the reference interval,
The estimation target signal is a feature amount of the input signal corresponding to the prediction interval,
Let the reference signal be the feature quantity of the input signal corresponding to the reference section,
A similar partial section detector that detects one or more similar partial sections that are similar to the reference signal from the accumulation section;
A prediction distribution generation unit that generates one or more prediction distributions that are distributions based on the feature amount of an input signal corresponding to a predetermined time interval immediately after the detected similar partial interval;
A saliency estimation unit that estimates the saliency of the input signal corresponding to the prediction interval based on the generated prediction distribution and the estimation target signal;
A saliency estimation device including:

A program for causing a computer to execute each step of the saliency estimation method according to any one of claims 1 to 3.