JP6093670B2

JP6093670B2 - Model processing apparatus, model processing method, and program

Info

Publication number: JP6093670B2
Application number: JP2013164025A
Authority: JP
Inventors: 桂右井本; 尚植松; 仲大室
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-08-07
Filing date: 2013-08-07
Publication date: 2017-03-08
Anticipated expiration: 2033-08-07
Also published as: JP2015031944A

Description

この発明は、音響信号列やそれに付随する音響特徴量列を利用して、状況と音響イベントとの関係を表すモデル、および音響イベントと音響特徴量との関係を表すモデルを作成する技術、ならびに生成されたモデルを利用して状況を分析、推定する技術に関する。 The present invention uses a sound signal sequence and an accompanying acoustic feature quantity sequence to create a model that expresses a relationship between a situation and an acoustic event, a model that expresses a relationship between an acoustic event and an acoustic feature quantity, and The present invention relates to a technique for analyzing and estimating a situation using a generated model.

非特許文献１に開示された従来技術では、各状況から生じた音響信号に対して、短時間区間ごとにその短時間区間の音響信号が何の音（足音，水が流れる音；以後、音響イベントとする）であるかを示すラベルが付与された、音響イベントラベル付き音響信号列を入力とし、連続する有限個のフレーム分の音響イベントラベルを用いて音響イベントラベルごとのヒストグラムを作成する。また、生成された音響イベントラベルごとのヒストグラムに対してＧＭＭ（Gaussian Mixture Model）、ＨＭＭ（Hidden Markov Model）、ＳＶＭ（Support Vector Machine）等のモデル化手法を用い、状況モデルを生成する。 In the prior art disclosed in Non-Patent Document 1, with respect to the acoustic signal generated from each situation, what sound (footstep, water-flowing sound; An acoustic signal sequence with an acoustic event label to which a label indicating whether it is an event is given as an input, and a histogram for each acoustic event label is created using acoustic event labels for a finite number of consecutive frames. In addition, a situation model is generated using a modeling technique such as GMM (Gaussian Mixture Model), HMM (Hidden Markov Model), or SVM (Support Vector Machine) for the generated histogram for each acoustic event label.

さらに、上記状況モデルと新たに入力された音響イベントラベル付き音響信号列から算出された音響イベントのヒストグラムをそれぞれ比較し（例えば、ユークリッド距離やコサイン距離などを用いて比較する）、複数の状況モデルのうち、最も判断基準に適合しているものをその音響信号列に対応する状況を表すと判定する。このように、従来技術では音響信号列から状況を推定することができる。 Furthermore, the above situation model is compared with the histogram of the acoustic event calculated from the newly input acoustic signal label with the acoustic event label (for example, comparison is performed using Euclidean distance, cosine distance, etc.), and a plurality of situation models are compared. Among them, it is determined that the one most suitable for the judgment criterion represents the situation corresponding to the acoustic signal sequence. Thus, according to the conventional technique, the situation can be estimated from the acoustic signal sequence.

井本他，“複数の生活音の出現頻度に基づくユーザ行動の識別手法とコミュニケーションへの応用”，画像電子学会第３２回ＶＭＡ研究会Imoto et al., “A user behavior identification method based on the frequency of appearance of multiple living sounds and its application to communication”, The 32nd VMA meeting of the Institute of Image Electronics Engineers of Japan

従来技術では、状況を分析、推定するための状況モデルと音響イベントラベルを作成するための音響イベントモデルとが別々に作成されていた。そのため、状況モデルと音響イベントモデルとの同時最適化ができず、音響信号列や音響特徴量列から状況をモデル化する際に誤差が生じるという問題点があった。 In the prior art, a situation model for analyzing and estimating the situation and an acoustic event model for creating an acoustic event label have been created separately. For this reason, the situation model and the acoustic event model cannot be simultaneously optimized, and there is a problem that an error occurs when the situation is modeled from the acoustic signal string or the acoustic feature quantity string.

本発明の課題は、状況と音響イベントとの関係、および音響イベントと音響特徴量との関係をそれぞれモデル化する際に、それらの同時最適化が可能な技術を提供することである。 An object of the present invention is to provide a technique capable of simultaneously optimizing a relationship between a situation and an acoustic event and a relationship between an acoustic event and an acoustic feature amount.

本発明では、少なくとも、音響特徴量列、音響イベントの種類の総数、および状況の種類の総数を用い、状況に対応する音響イベントの組み合わせと、音響信号列に対応する状況の組み合わせと、音響イベントに対応する音響特徴量と、に対応する同時分布の最大値を探索する学習処理を行い、少なくとも、状況が音響イベントを生成する確率Ｐ（音響イベント｜状況）、および音響イベントが音響特徴量を生成する確率Ｐ（音響特徴量｜音響イベント）を得る。 In the present invention, at least the acoustic feature string, the total number of types of acoustic events, and the total number of types of situations are used, a combination of acoustic events corresponding to the situation, a combination of situations corresponding to the acoustic signal string, and an acoustic event And a learning process for searching for the maximum value of the simultaneous distribution corresponding to, and at least the probability P (acoustic event | situation) that the situation generates an acoustic event, and the acoustic event A probability P to be generated (acoustic feature amount | acoustic event) is obtained.

本発明では、状況と音響イベントとの関係、および音響イベントと音響特徴量との関係をそれぞれモデル化する際に、それらの同時最適化が可能となる。 In the present invention, when modeling the relationship between the situation and the acoustic event and the relationship between the acoustic event and the acoustic feature amount, it is possible to simultaneously optimize them.

実施例１−１の装置構成を例示した図。The figure which illustrated the apparatus structure of Example 1-1. 実施例１−２の装置構成を例示した図。The figure which illustrated the apparatus structure of Example 1-2. 実施例２−１の装置構成を例示した図。The figure which illustrated the apparatus structure of Example 2-1. 実施例２−２の装置構成を例示した図。The figure which illustrated the apparatus structure of Example 2-2. 実施例３−１の装置構成を例示した図。The figure which illustrated the apparatus structure of Example 3-1. 実施例３−２の装置構成を例示した図。The figure which illustrated the apparatus configuration of Example 3-2.

以下、図面を参照して本発明の実施例を説明する。
＜用語の定義＞
実施例で用いる用語を定義する。
「音響イベント」とは、音の事象を意味する。「音響イベント」の具体例は、「包丁の音」「水が流れる音」「水音」「着火音」「火の音」「足音」「掃除機の排気音」などである。「音響イベントラベル」とは、音響イベントを表すラベルを意味する。「音響イベントラベル列」とは、１個以上の音響イベントラベルからなる列を意味する。 Embodiments of the present invention will be described below with reference to the drawings.
<Definition of terms>
Terms used in the examples are defined.
An “acoustic event” means a sound event. Specific examples of the “acoustic event” include “knife sound”, “water flowing sound”, “water sound”, “ignition sound”, “fire sound”, “foot sound”, and “vacuum exhaust sound”. The “acoustic event label” means a label representing an acoustic event. The “acoustic event label sequence” means a sequence composed of one or more acoustic event labels.

「状況」とは、音響イベントラベルの組み合わせによって規定される、潜在的な音響状態を意味する。言い換えると、「状況」とは、音響イベントによって規定される、潜在的な場の状況を意味する。「状況ラベル」とは、状況を表すラベルを意味する。「状況ラベル列」とは、１個以上の状況ラベルからなる列を意味する。 “Situation” means a potential acoustic state defined by a combination of acoustic event labels. In other words, “situation” means a potential field situation defined by an acoustic event. “Situation label” means a label indicating a situation. The “situation label column” means a column composed of one or more situation labels.

「ＸがＹを生成する確率」とは、事象Ｘが起こるという条件のもとでの事象Ｙが起こる確率をいう。「ＸがＹを生成する確率」は、「ＸのもとでのＹの条件付き確率」や「ＸにおけるＹの条件付き確率」とも表現できる。 “Probability that X generates Y” refers to the probability that event Y will occur under the condition that event X occurs. The “probability that X generates Y” can also be expressed as “the conditional probability of Y under X” or “the conditional probability of Y in X”.

［実施例１−１］
実施例１−１では、学習用情報として音響特徴量列を入力とした学習処理によって、状況が音響イベントを生成する確率Ｐ（音響イベント｜状況）である状況−音響イベント生成モデル、および音響イベントが音響特徴量を生成する確率Ｐ（音響特徴量｜音響イベント）である音響イベント−音響特徴量生成モデルを算出する。また、この学習処理によって、さらに音響信号が状況を生成する確率Ｐ（状況｜音響信号）である音響信号−状況生成モデルを生成してもよい。例えば、確率Ｐ（音響イベント｜状況）は、複数個の音響イベントと状況の組ごとに生成され、確率Ｐ（音響特徴量｜音響イベント）は、複数個の音響特徴量と音響イベントの組ごとに生成され、確率Ｐ（状況｜音響信号）は、複数個の状況と音響信号の組ごとに生成される。或いは、例えば、確率Ｐ（音響イベント｜状況）は、音響イベントと状況の組に対して確率Ｐ（音響イベント｜状況）を与える関数であり、確率Ｐ（音響特徴量｜音響イベント）は、音響特徴量と音響イベントの組に対して確率Ｐ（音響特徴量｜音響イベント）を与える関数であり、確率Ｐ（状況｜音響信号）は、状況と音響信号の組に対して確率Ｐ（状況｜音響信号）を与える関数である。さらに、この学習処理の過程でえられた情報から状況ラベル列を生成してもよいし、音響イベントラベル列を生成してもよい。 [Example 1-1]
In Example 1-1, a situation-acoustic event generation model in which a situation is a probability P (acoustic event | situation) of generating an acoustic event by learning processing using an acoustic feature string as input for learning, and an acoustic event Calculates an acoustic event-acoustic feature quantity generation model that is a probability P (acoustic feature quantity | acoustic event) of generating an acoustic feature quantity. Further, through this learning process, an acoustic signal-situation generation model having a probability P (situation | acoustic signal) that the acoustic signal further generates a situation may be generated. For example, the probability P (acoustic event | situation) is generated for each set of a plurality of acoustic events and situations, and the probability P (acoustic feature quantity | acoustic event) is set for each set of a plurality of acoustic feature quantities and acoustic events. The probability P (situation | acoustic signal) is generated for each set of a plurality of situations and acoustic signals. Alternatively, for example, the probability P (acoustic event | situation) is a function that gives a probability P (acoustic event | situation) to a pair of the acoustic event and the situation, and the probability P (acoustic feature quantity | acoustic event) This is a function that gives a probability P (acoustic feature quantity | acoustic event) to a set of feature quantity and acoustic event, and probability P (situation | acoustic signal) is a probability P (situation | (Sound signal). Furthermore, a situation label string may be generated from information obtained in the course of the learning process, or an acoustic event label string may be generated.

図１に例示するように、本実施例のモデル処理装置１１０は、音響特徴量列合成部１０１、状況／音響イベントモデル化部１０２（モデル化部）、及び記憶部１０３を有する。状況／音響イベントモデル化部１０２は、例えば、初期化部１０２ａ、第１〜４更新部１０２ｂ〜１０２ｅ、判定部１０２ｆ、モデル算出部１０２ｇ、および解析部１０２ｈを有する。モデル処理装置１１０は、例えば、公知又は専用のコンピュータに所定のプログラムが読み込まれることで構成される。 As illustrated in FIG. 1, the model processing apparatus 110 according to the present exemplary embodiment includes an acoustic feature quantity sequence synthesizing unit 101, a situation / acoustic event modeling unit 102 (modeling unit), and a storage unit 103. The situation / acoustic event modeling unit 102 includes, for example, an initialization unit 102a, first to fourth update units 102b to 102e, a determination unit 102f, a model calculation unit 102g, and an analysis unit 102h. The model processing apparatus 110 is configured, for example, by reading a predetermined program into a known or dedicated computer.

まず音響特徴量列合成部１０１に、音響特徴量列１１−１，・・・，１１−Ｓ（ただし、Ｓは１以上の整数）が入力される。各音響特徴量列１１−ｓ（ただし、ｓ＝１，・・・，Ｓ）は、１個の音響特徴量または２個以上の音響特徴量を時系列方向（例えば、時系列順）につなぎ合わせた列である。各音響特徴量は、短時間区間ごと（数１０ｍｓｅｃ〜数ｓｅｃ程度ごと）の音響信号から得られる。各音響特徴量は複数個の要素からなるベクトルであってもよいし、単数の要素からなるスカラーであってもよい。音響特徴量の要素の例は、音響信号の音圧レベル、音響パワー、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）特徴量、ＬＰＣ（Linear Predictive Coding）特徴量である。さらに音響信号の立ち上がり特性、調波性、時間周期性など（例えば、非特許文献１参照）が音響特徴量の要素であってもよい。また、各音響特徴量列１１−ｓには音響特徴量列番号ｓが付与されている。 First, acoustic feature quantity sequences 11-1,..., 11-S (where S is an integer equal to or greater than 1) are input to the acoustic feature quantity sequence synthesis unit 101. Each acoustic feature amount column 11-s (where s = 1,..., S) connects one acoustic feature amount or two or more acoustic feature amounts in a time-series direction (for example, time-series order). It is a combined column. Each acoustic feature amount is obtained from an acoustic signal for each short time interval (every tens of milliseconds to several seconds). Each acoustic feature amount may be a vector composed of a plurality of elements, or a scalar composed of a single element. Examples of the elements of the acoustic feature amount are a sound pressure level, an acoustic power, an MFCC (Mel-Frequency Cepstrum Coefficient) feature amount, and an LPC (Linear Predictive Coding) feature amount. Furthermore, the rising characteristic, harmonicity, time periodicity, and the like of the acoustic signal (see, for example, Non-Patent Document 1) may be elements of the acoustic feature amount. Each acoustic feature quantity column 11-s is assigned an acoustic feature quantity column number s.

複数個の音響特徴量列１１−１，・・・，１１−Ｓが音響特徴量列合成部１０１に入力された場合、音響特徴量列合成部１０１は、それらを時系列方向（例えば、時系列順）につなぎ合わせ、それによって１つの音響特徴量列１１を得て出力する（合成処理）。音響特徴量列合成部１０１に１つの音響特徴量列１１−１のみが入力された場合、音響特徴量列合成部１０１はそれを音響特徴量列１１として出力する。音響特徴量列合成部１０１から出力された音響特徴量列１１は、状況／音響イベントモデル化部１０２に入力される。なお、音響特徴量列合成部１０１を経由することなく、１つ音響特徴量列１１がそのまま状況／音響イベントモデル化部１０２に入力されてもよい。 When a plurality of acoustic feature value sequences 11-1,..., 11-S are input to the acoustic feature value sequence synthesizing unit 101, the acoustic feature value sequence synthesizing unit 101 converts them into a time-series direction (for example, time (Sequence order), thereby obtaining and outputting one acoustic feature quantity sequence 11 (synthesis process). When only one acoustic feature amount sequence 11-1 is input to the acoustic feature amount sequence combining unit 101, the acoustic feature amount sequence combining unit 101 outputs it as the acoustic feature amount sequence 11. The acoustic feature quantity sequence 11 output from the acoustic feature quantity sequence synthesis unit 101 is input to the situation / acoustic event modeling unit 102. Note that one acoustic feature quantity sequence 11 may be directly input to the situation / acoustic event modeling unit 102 without going through the acoustic feature quantity sequence synthesis unit 101.

状況／音響イベントモデル化部１０２は、以下の手順に従って、入力された音響特徴量列１１から、音響信号が状況を生成する確率Ｐ（状況｜音響信号）である音響信号−状況生成モデル１２、状況が音響イベントを生成する確率Ｐ（音響イベント｜状況）である状況−音響イベント生成モデル１３、および音響イベントが音響特徴量を生成する確率Ｐ（音響特徴量｜音響イベント）である音響イベント−音響特徴量生成モデル１４を算出する（出力する）。さらに、状況／音響イベントモデル化部１０２は、状況ラベル列１５を生成してもよいし、音響イベントラベル列１６を生成してもよい。ただし、状況／音響イベントモデル化部１０２が、音響信号−状況生成モデル１２や状況ラベル列１５や音響イベントラベル列１６を生成することは必須ではない。状況／音響イベントモデル化部１０２が生成したモデルや列は記憶部１０３に格納される。 The situation / acoustic event modeling unit 102 performs an acoustic signal-situation generation model 12 having a probability P (situation | acoustic signal) that an acoustic signal generates a situation from the inputted acoustic feature quantity sequence 11 according to the following procedure. Situation in which the situation is a probability P (acoustic event | situation) for generating an acoustic event-Acoustic event generation model 13 and an acoustic event in which the acoustic event has a probability P (acoustic feature quantity | acoustic event) for generating an acoustic feature- The acoustic feature quantity generation model 14 is calculated (output). Further, the situation / acoustic event modeling unit 102 may generate the situation label string 15 or the acoustic event label string 16. However, it is not essential for the situation / acoustic event modeling unit 102 to generate the acoustic signal-situation generation model 12, the situation label sequence 15, and the acoustic event label sequence 16. The model and sequence generated by the situation / acoustic event modeling unit 102 are stored in the storage unit 103.

＜音響信号から音響特徴量が生成される過程の理論的説明＞
音響信号が状況の生成確率を規定し、状況が音響イベントの生成確率を規定し、音響イベントが音響特徴量の生成確率を規定すると考え、これらの関係を生成モデルとして記述する。 <Theoretical explanation of the process of generating acoustic features from acoustic signals>
It is assumed that the acoustic signal defines the generation probability of the situation, the situation defines the generation probability of the acoustic event, and the acoustic event defines the generation probability of the acoustic feature quantity, and these relationships are described as a generation model.

状況／音響イベントモデル化部１０２に入力された音響特徴量列１１を構成する各音響特徴量列１１−ｓ（ただし、ｓ＝１，・・・，Ｓ）に対応する音響信号が状況ｔ（ただし、ｔ＝１，・・・，Ｔ）を生成する確率Ｐ（Θ）（例えばＳ×Ｔ行列で表現可能）、各状況ｔ（ただし、ｔ＝１，・・・，Ｔ）が音響イベントｍ（ただし、ｍ＝１，・・・，Ｍ）を生成する確率Ｐ（Φ）（例えばＴ×Ｍ行列で表現可能）、および各音響イベントｍ（ただし、ｍ＝１，・・・，Ｍ）が音響特徴量を生成する確率Ｐ（μ，Λ）（例えば、Ｍ×Ｄの平均行列とＭ×Ｄ×Ｄの分散行列で表現可能）が与えられたときの、音響特徴量列１１の生成確率Ｐ（ｆ｜Θ，Φ，μ，Λ）は以下の通りである。

ただし、Ｓは１以上の整数であり、音響特徴量列１１を構成する音響特徴量列１１−ｓの個数を表す。Ｔは１以上の整数であり、潜在的な状況の種類の数（状況の種類の総数）を表す。Ｍは１以上の整数であり、音響イベントの種類の数（音響イベントの種類の総数）を表す。Ｄは１以上の整数定数であり、音響特徴量の次元数を表す。ｆは音響特徴量列１１を構成する音響特徴量を要素とした列である。Θは音響特徴量列１１−ｓと状況ｔとの組からなる集合を表し、Ｐ（Θ）は、例えば、音響特徴量列１１−ｓが状況ｔを生成する確率をｓ行ｔ列の要素とするＳ×Ｔ行列で表現できる。Φは状況ｔと音響イベントｍとの組からなる集合を表し、Ｐ（Φ）は、例えば状況ｔが音響イベントｍを生成する確率をｔ行ｍ列の要素とするＴ×Ｍ行列で表現できる。μは音響イベントｍによって発生した音響信号の音響特徴量の平均値μ_ｍからなる列μ_１，・・・，μ_Ｍを表す。音響イベントｍによって発生した各音響特徴量が複数の要素ｖｃ_ｍｄ（ただし、ｄ＝１，・・・，Ｄ）からなるベクトル（ｖｃ_ｍ１，・・・，ｖｃ_ｍＤ）である場合（Ｄ≧２の場合）、μ_ｍは要素ｖｃ_ｍ１ｄからｖｃ_ｍＥｄ（ただし、ｖｃ_ｍｄ∈｛ｖｃ_ｍ１ｄ，・・・，ｖｃ_ｍＥｄ｝であり、Ｅは音響イベントｍに割り当てられる音響特徴量の数を表す）についてのｖｃ_ｍｄの期待値ｍｅａｎ（ｖｃ_ｍｄ）を要素とするベクトル（ｍｅａｎ（ｖｃ_ｍ１），・・・，ｍｅａｎ（ｖｃ_ｍＤ））である。Λは音響イベントｍによって発生した音響信号の音響特徴量の分散の逆数（精度）Λ_ｍからなる列Λ_１，・・・，Λ_Ｍを表す。音響イベントｍによって発生した各音響特徴量が複数の要素ｖｃ_ｍｄからなるベクトル（ｖｃ_ｍ１，・・・，ｖｃ_ｍＤ）である場合（Ｄ≧２の場合）、Λ_ｍは要素ｖｃ_ｍ１ｄからｖｃ_ｍＥｄ（ただし、Ｅは音響イベントｍに割り当てられる音響特徴量の数を表す）の分散ｖｅｒ（ｖｃ_ｍｄ）の逆数１／ｖｅｒ（ｖｃ_ｍｄ）を要素とするベクトル（１／ｖｅｒ（ｖｃ_ｍ１），・・・，１／ｖｅｒ（ｖｃ_ｍＤ））である。ｆ_ｓは音響特徴量列１１−ｓを表し、音響特徴量列１１−ｓが含むＮ_ｓ個の音響特徴量からなる列を表す。Ｎ_ｓは音響特徴量列１１−ｓが含む短時間区間ごとの音響特徴量の個数を表す。言い換えると、Ｎ_ｓは音響特徴量列１１−ｓに対応する時間区間が含む短時間区間の個数を表す。 The acoustic signal corresponding to each acoustic feature sequence 11-s (where s = 1,..., S) constituting the acoustic feature sequence 11 input to the situation / acoustic event modeling unit 102 is the situation t ( However, the probability P (Θ) for generating t = 1,..., T) (for example, can be expressed by an S × T matrix), and each situation t (where t = 1,..., T) is an acoustic event. probability P (Φ) that generates m (where m = 1,..., M) (representable by a T × M matrix, for example), and each acoustic event m (where m = 1,..., M ) Is given a probability P (μ, Λ) (for example, an M × D average matrix and an M × D × D variance matrix) for generating an acoustic feature amount, The generation probability P (f | Θ, Φ, μ, Λ) is as follows.

However, S is an integer greater than or equal to 1, and represents the number of acoustic feature amount sequences 11-s constituting the acoustic feature amount sequence 11. T is an integer of 1 or more, and represents the number of potential situation types (total number of situation types). M is an integer of 1 or more and represents the number of types of acoustic events (total number of types of acoustic events). D is an integer constant of 1 or more, and represents the number of dimensions of the acoustic feature amount. f is a column having the acoustic feature quantity constituting the acoustic feature quantity sequence 11 as an element. Θ represents a set of a set of acoustic feature quantity column 11-s and situation t, and P (Θ) represents, for example, the probability that acoustic feature quantity column 11-s generates situation t is an element of s rows and t columns. Can be expressed as an S × T matrix. Φ represents a set of a set of the situation t and the acoustic event m, and P (Φ) can be expressed by a T × M matrix having, for example, the probability that the situation t generates the acoustic event m as an element of t rows and m columns. . μ represents a column μ ₁ ,..., μ _M composed of an average value μ _m of acoustic feature amounts of acoustic signals generated by the acoustic event m. Each acoustic feature amount generated by the acoustic event m is a vector (vc _m1 ,..., _{Vc mD} ) composed of a plurality of elements vc _md (d = 1,..., D) (D ≧ 2 ), Μ _m is for elements vc _m1d to vc _mEd (where vc _md ε {vc _m1d ,..., _{Vc mEd} }, and E represents the number of acoustic features assigned to the acoustic event m) vector of _{vc md} expected value mean _{(vc md)} and component _{(mean (vc m1), ···} , mean (vc mD)) is. [Lambda] represents a sequence [Lambda] ₁ ,..., [Lambda] _M composed of reciprocal (accuracy) [Lambda] _m of the acoustic feature amount of the acoustic signal generated by the acoustic event m. When each acoustic feature amount generated by the acoustic event m is a vector (vc _m1 ,..., _{Vc mD} ) composed of a plurality of elements vc _md (when D ≧ 2), Λ _m is _derived from the elements vc _m1d to vc _mEd. (Where E represents the number of acoustic feature values assigned to the acoustic event m), a vector (1 / ver (vc _m1 ),... Of the inverse 1 / ver (vc _md ) of the variance vers (vc _md ) _.. , 1 / ver (vc _mD )). f _s represents the acoustic features columns 11-s, representing the column consisting of N _s number of acoustic features, including the acoustic feature sequence 11-s. N _s represents the number of acoustic feature amounts for each short time section included in the acoustic feature amount sequence 11-s. In other words, N _s represents the number of short time sections included in the time section corresponding to the acoustic feature quantity sequence 11-s.

また、音響特徴量列１１−ｓの生成確率Ｐ（ｆ_ｓ）は、例えば、各音響信号が状況を生成する確率θの事前分布（Ｄｉｒｃｈｌｅｔ分布に従うものとする）の超パラメータα、各状況が音響イベントを生成する確率φの事前分布（Ｄｉｒｃｈｌｅｔ分布に従うものとする）の超パラメータγ、各音響イベントにおける音響特徴量の平均の超パラメータβ_０，μ_０、各音響イベントにおける音響特徴量の精度の超パラメータν_０，Ｂ_０を用いて以下のように表すことができる。

ただし、ｆ_ｓ，ｉ、ｚ_ｓ，ｉ、ｍ_ｓ，ｉは、それぞれ、音響特徴量列１１−ｓに含まれる先頭からｉ番目の短時間区間での音響特徴量、状況、音響イベントを表す。Ｄｉｒ（・），Ｎ（・），Ｗ（・）は、それぞれ、Ｄｉｒｉｃｈｌｅｔ分布の確率密度関数、Ｎｏｒｍａｌ分布の確率密度関数、Ｗｉｓｈａｒｔ分布の確率密度関数を表す。 Also, the acoustic feature quantity column 11-s of generation probability P (f _s), for example, hyper-parameters of the prior distribution of probabilities θ of each acoustic signal to generate a status (to be subject to Dirchlet distribution) alpha, each situation Super parameter γ of prior distribution of probability φ to generate an acoustic event (according to the Dirchlet distribution), average super parameters β ₀ , μ ₀ of acoustic feature quantities in each acoustic event, accuracy of acoustic feature quantity in each acoustic event Can be expressed as follows using the hyperparameters ν ₀ and B ₀ of

Here, f _{s, i} , z _{s, i} , m _{s, i} represent the acoustic feature amount, the situation, and the acoustic event in the i-th short time section from the head included in the acoustic feature amount sequence 11-s, respectively. . Dir (•), N (•), and W (•) represent the probability density function of the Dirichlet distribution, the probability density function of the Normal distribution, and the probability density function of the Wishart distribution, respectively.

ここでＫ−１次（Ｋは２以上の整数）のＤｉｒｉｃｈｌｅｔ分布の確率密度関数Ｄｉｒ（ι｜τ）、およびＤ次のＧａｕｓｓ−Ｗｉｓｈａｒｔ分布の確率密度関数Ｎ（μ｜β_０，μ_０，Λ）Ｗ（Λ｜ν_０，Ｂ_０）は以下の通りである。

ただし、τはτ_ｋ（ｋ＝１，...，Ｋ）からなるパラメータ、ιは確率変数、Γはガンマ関数を表す。（・）^Ｔは（・）の転置を表す。また、

である。 Here, the probability density function Dir (ι | τ) of the K-1 order (K is an integer of 2 or more) Dirichlet distribution, and the probability density function N (μ | β ₀ , μ ₀ , D-order Gauss-Wishart distribution) Λ) W (Λ | ν ₀ , B ₀ ) is as follows.

Here, τ represents a parameter composed of τ _k (k = 1,..., K), ι represents a random variable, and Γ represents a gamma function. (•) ^T represents transposition of (•). Also,

It is.

＜生成モデルの算出過程の説明＞
状況／音響イベントモデル化部１０２は、入力された音響特徴量列１１から、学習処理によって、前述の生成モデルやラベル列等を生成する。この学習処理は、入力された音響特徴量列１１に基づいて、音響信号が状況を生成する確率Ｐ（状況｜音響信号）、状況が音響イベントを生成する確率Ｐ（音響イベント｜状況）、および音響イベントが音響特徴量を生成する確率Ｐ（音響特徴量｜音響イベント）に基づく、状況に対応する音響イベントの組み合わせと、音響信号列に対応する状況の組み合わせと、音響イベントに対応する音響特徴量と、に対応する同時分布の最大値を探索する処理によって行われる。言い換えると、状況／音響イベントモデル化部１０２は、音響信号が状況を生成する確率Ｐ（状況｜音響信号）、状況が音響イベントを生成する確率Ｐ（音響イベント｜状況）、および音響イベントが音響特徴量を生成する確率Ｐ（音響特徴量｜音響イベント）おいて、入力された音響特徴量列１１の尤もらしさ（尤度または対数尤度）を最大化する学習処理（最尤学習）を行う。言い換えると、状況／音響イベントモデル化部１０２は、音響信号−状況生成モデル１２のモデルパラメータ、状況−音響イベント生成モデル１３のモデルパラメータ、および音響イベント−音響特徴量生成モデル１４のモデルパラメータにおいて、入力された音響特徴量列１１の尤もらしさ（すなわち、尤度関数Ｌ（音響特徴量列｜パラメータ）＝Ｐ（音響特徴量列｜パラメータ）または対数尤度関数ｌｏｇＬ（音響特徴量列｜パラメータ））を最大化する学習処理を行い、その結果を用いて各生成モデルや各ラベル列を生成する。なお、「ｌｏｇ」は自然対数を表す。 <Description of generation model calculation process>
The situation / acoustic event modeling unit 102 generates the above-described generation model, label sequence, and the like from the input acoustic feature amount sequence 11 through learning processing. This learning process is based on the input acoustic feature quantity sequence 11 and the probability P (situation | acoustic signal) that the acoustic signal generates a situation, the probability P that the situation generates an acoustic event (acoustic event | situation), and Based on a probability P (acoustic feature amount | acoustic event) that an acoustic event generates an acoustic feature amount, a combination of acoustic events corresponding to a situation, a combination of situations corresponding to an acoustic signal sequence, and an acoustic feature corresponding to the acoustic event And a process for searching for the maximum value of the simultaneous distribution corresponding to the quantity. In other words, the situation / acoustic event modeling unit 102 determines the probability P (situation | acoustic signal) that the acoustic signal generates a situation, the probability P (acoustic event | situation) that the situation generates an acoustic event, and the acoustic event is acoustic. A learning process (maximum likelihood learning) that maximizes the likelihood (likelihood or logarithmic likelihood) of the input acoustic feature quantity sequence 11 is performed at a probability P (acoustic feature quantity | acoustic event) of generating a feature quantity. . In other words, the situation / acoustic event modeling unit 102 uses the model parameters of the acoustic signal-situation generation model 12, the model parameters of the situation-acoustic event generation model 13, and the model parameters of the acoustic event-acoustic feature generation model 14. Likelihood of the input acoustic feature quantity sequence 11 (that is, likelihood function L (acoustic feature quantity sequence | parameter) = P (acoustic feature quantity sequence | parameter) or log likelihood function log L (acoustic feature quantity sequence | parameter) )) Is maximized, and each generation model and each label sequence is generated using the learning process. “Log” represents a natural logarithm.

このような学習には、上記の生成過程に基づいたマルコフ連鎖モンテカルロ法（ＭＣＭＣ法，ＭａｒｋｏｖＣｈａｉｎＭｏｎｔｅＣａｒｌｏｍｅｔｈｏｄｓ）や変分ベイズ法（ＶＢ法，ＶａｒｉａｔｉｏｎａｌＢａｙｅｓｍｅｔｈｏｄｓ）などの手法を用いることができる。ここでは変分ベイズ法による生成モデルのパラメータ算出手法について説明を行う。 For such learning, a Markov chain Monte Carlo method (MCMC method, Markov Chain Monte Carlo methods) or a variational Bayes method (VB method, Variational Bayes methods) based on the above generation process can be used. Here, the parameter calculation method of the generation model by the variational Bayes method will be described.

＜生成モデルの算出のための準備＞
変分ベイズ法による生成モデルのパラメータ算出では、未知のモデルパラメータα，γ，μ_０，β_０，ν_０，Ｂ_０を確率変数として扱い、音響特徴量列１１であるｆについての対数尤度関数を最大化するモデルパラメータα，γ，μ_０，β_０，ν_０，Ｂ_０を求める。ここで、この対数尤度関数の未知であるすべてのモデルパラメータα，γ，μ_０，β_０，ν_０，Ｂ_０を周辺化した対数周辺尤度Ｌ（ｆ）＝ｐ（ｆ｜α，γ，μ_０，β_０，ν_０，Ｂ_０）を考える。ここで新たな分布ｑ（ｍ，ｚ，μ，Λ，φ，θ）（以下、「変分事後分布」と呼ぶ）を導入すると、Ｊｅｎｓｅｎの不等式によって、以下のような対数周辺尤度の下限値（ＬｏｗｅｒＢｏｕｎｄ）Ｆ［ｑ］を求めることができる。 <Preparation for generation model calculation>
In the generation model parameter calculation by the variational Bayes method, the unknown model parameters α, γ, μ ₀ , β ₀ , ν ₀ , B ₀ are treated as random variables, and the log likelihood for f which is the acoustic feature string 11 is used. The model parameters α, γ, μ ₀ , β ₀ , ν ₀ , B ₀ that maximize the function are obtained. Here, logarithmic marginal likelihood L (f) = p (f | α, which is a marginalization of all unknown model parameters α, γ, μ ₀ , β ₀ , ν ₀ , B ₀ of the log likelihood function. Consider γ, μ ₀ , β ₀ , ν ₀ , B ₀ ). Here, when a new distribution q (m, z, μ, Λ, φ, θ) (hereinafter referred to as “variant posterior distribution”) is introduced, the lower bound of the logarithmic marginal likelihood is as follows according to Jensen's inequality. A value (Lower Bound) F [q] can be obtained.

ただし、＜Ｐ（・）＞_ｑ（・）はＰ（・）のｑ（・）に関する期待値を表す。また、ｚは音響特徴量列１１に対応する状況からなる列であり、φは状況が音響イベントを生成する確率を表す変数であり、θは音響信号が状況を表す確率を表す変数である。なお、下限値Ｆ［ｑ］は変分事後分布ｑ（ｍ，ｚ，μ，Λ，φ，θ）を変関数とする汎関数である。

However, <P (•)> _{q (•)} represents an expected value for _{q (•)} of P (•). Further, z is a column composed of situations corresponding to the acoustic feature amount column 11, φ is a variable representing the probability that the situation generates an acoustic event, and θ is a variable representing the probability that the acoustic signal represents the situation. The lower limit value F [q] is a functional having the variational posterior distribution q (m, z, μ, Λ, φ, θ) as a variable function.

また、上記の式から以下が成り立つ。

Moreover, the following holds from the above formula.

よって、以下の関係が成り立つ。
Ｌ（ｆ）＝Ｆ［ｑ］＋ＫＬ（ｑ（ｍ，ｚ，μ，Λ，φ，θ），ｐ（ｍ，ｚ，μ，Λ，φ，θ｜ｆ））
ただし、ＫＬ（・）は、ＫＬはダイバージェンスを表すものとする。 Therefore, the following relationship is established.
L (f) = F [q] + KL (q (m, z, μ, Λ, φ, θ), p (m, z, μ, Λ, φ, θ | f))
However, KL (·) represents divergence.

ここで、Ｌ（ｆ）がｆのみに依存することに注意すると、下限値Ｆ［ｑ］を最大化することは、ｑ（ｍ，ｚ，μ，Λ，φ，θ）とｐ（ｍ，ｚ，μ，Λ，φ，θ｜ｆ）とのＫＬダイバージェンスを最小化することと等価であることがわかる。言い換えると、下限値Ｆ［ｑ］を最大化する変分事後分布ｑ（ｍ，ｚ，μ，Λ，φ，θ）は、真の事後分布ｐ（ｍ，ｚ，μ，Λ，φ，θ｜ｆ）の最良近似となる。ここで，変分事後分布についてｑ（ｍ，ｚ，μ，Λ，φ，θ）＝ｑ（ｍ，ｚ）ｑ（μ，Λ，φ，θ）を仮定する。ｍ，ｚは変分ベイズ学習における隠れ変数（非観測変数）に相当し、μ，Λ，φ，θはパラメータに相当する。すると、下限値Ｆ［ｑ］は以下のように変形できる。

Note that L (f) depends only on f, and maximizing the lower limit value F [q] means that q (m, z, μ, Λ, φ, θ) and p (m, It can be seen that this is equivalent to minimizing the KL divergence with z, μ, Λ, φ, θ | f). In other words, the variational posterior distribution q (m, z, μ, Λ, φ, θ) that maximizes the lower limit F [q] is the true posterior distribution p (m, z, μ, Λ, φ, θ). | F) is the best approximation. Here, q (m, z, μ, Λ, φ, θ) = q (m, z) q (μ, Λ, φ, θ) is assumed for the variational posterior distribution. m and z correspond to hidden variables (unobserved variables) in variational Bayes learning, and μ, Λ, φ, and θ correspond to parameters. Then, the lower limit value F [q] can be modified as follows.

まず、ｑ（ｍ，ｚ）＝ｑ（ｍ｜ｚ）ｑ（ｚ）とし、隠れ変数ｍ，ｚの変分事後分布の導出を行う。Ｆ［ｑ］において、ｚに依存しない項を定数項と見なし、ラグランジュの未定乗数法などを用いてｚの変分事後分布ｑ（ｚ）を導出すると、ｑ（ｚ）は多項分布の積で表現可能であることがわかる。そこで、ｑ（ｚ）のパラメータｒ_ｎｔを導入する。すると、ｑ（ｚ）は以下のように表現できる。

ただし、音響特徴量列１１に対応する時間区間が含む短時間区間の個数をＮとし（Ｎ＝Σ_ｓ＝１ ^ＳＮ_ｓ）、ｎ＝１，・・・，Ｎとする。ｚ_ｎｔは音響特徴量列１１に含まれる先頭からｎ番目の音響特徴量が状況ｔに対応する場合に１となり、そうでない場合に０となる。 First, q (m, z) = q (m | z) q (z) is set, and a variational posterior distribution of hidden variables m and z is derived. In F [q], if a term independent of z is regarded as a constant term and the variational posterior distribution q (z) of z is derived using Lagrange's undetermined multiplier method or the like, q (z) is a product of multinomial distributions. It can be seen that it can be expressed. Therefore, the parameter r _nt of q (z) is introduced. Then, q (z) can be expressed as follows.

However, let N be the number of short time sections included in the time section corresponding to the acoustic feature quantity sequence 11 (N = Σ _{s = 1} ^S N _s ), and n = 1,. z _nt is 1 when the nth acoustic feature amount from the head included in the acoustic feature amount sequence 11 corresponds to the situation t, and 0 otherwise.

同様に、ｍの変分事後分布ｑ（ｍ｜ｚ）を導出すると、ｑ（ｍ｜ｚ）は多項分布の積で表現可能であることが分かる。そこで、ｑ（ｍ｜ｚ）のパラメータｕ_ｎｍを導入する。すると、ｑ（ｍ｜ｚ）は以下のように表現できる。

ただし、ｙ_ｎｍは音響特徴量列１１に含まれる先頭からｎ番目の音響特徴量が音響イベントｍに対応する場合に１となり、そうでない場合に０となる。 Similarly, if the variational posterior distribution q (m | z) of m is derived, it can be seen that q (m | z) can be expressed by a product of multinomial distributions. Therefore, the parameter u _nm of q (m | z) is introduced. Then, q (m | z) can be expressed as follows.

However, y _nm is 1 when the nth acoustic feature amount from the head included in the acoustic feature amount sequence 11 corresponds to the acoustic event m, and 0 otherwise.

次に、ｑ（μ，Λ，φ，θ）＝ｑ（φ）ｑ（θ）ｑ（μ｜Λ）ｑ（Λ）と仮定し、パラメータμ，Λ，φ，θの変分事後分布を導出する。まず、パラメータｒ_ｎｔのうち、音響特徴量列１１−ｓに対応する時間区間の先頭からｎ’番目（ｎ’＝１，・・・，Ｎ_ｓ）の短時間区間に対応するパラメータをｒ_ｓｎ’ｔとおく。すなわち、以下の関係を満たす。

また、Ｎ_ｓｔを以下のようにおく。

すると、パラメータθの変分事後分布ｑ（θ）は、以下の形のディリクレ分布となる。

ただし、θ_ｓｔは音響信号ｓが状況ｔを生成する確率を表し、Ｃ_θはｑ（θ）のθについての全空間積分値を１とするための規格化定数である。 Next, assuming that q (μ, Λ, φ, θ) = q (φ) q (θ) q (μ | Λ) q (Λ), the variational posterior distribution of the parameters μ, Λ, φ, θ is To derive. First, among the parameters r _nt , the parameters corresponding to the n′-th (n ′ = 1,..., N _s ) short time interval from the beginning of the time interval corresponding to the acoustic feature quantity sequence 11-s are set to r _{sn. 't} . That is, the following relationship is satisfied.

N _st is set as follows.

Then, the variational posterior distribution q (θ) of the parameter θ is a Dirichlet distribution having the following form.

Here, θ _st represents the probability that the acoustic signal s generates the situation t, and C _θ is a normalization constant for setting the total spatial integration value for _θ of q (θ) to 1.

また、Ｎ_ｔｍを以下のようにおく。

すると、パラメータφの変分事後分布ｑ（φ）は、以下の形のディリクレ分布となる。

ただし、Ｃ_φはｑ（φ）のφについての全空間積分値を１とするための規格化定数である。 Further, N _tm is set as follows.

Then, the variational posterior distribution q (φ) of the parameter φ is a Dirichlet distribution having the following form.

However, C _φ is a normalization constant for setting the total spatial integration value for _φ of q (φ) to 1.

同様に、μ_ｍの変分事後分布ｑ（μ_ｍ｜Λ_ｍ）は以下のように算出可能である。

つまり、ｑ（μ_ｍ｜Λ_ｍ）は平均がμ_ｍ、共分散がβ_ｍΛ_ｍのガウス分布であることが分かる。 Similarly, mu _m variational posterior distribution q (μ _{_m |} Λ _{_m)} can be calculated as follows.

That is, it can be seen that q (μ _m | Λ _m ) is a Gaussian distribution with an average of μ _m and a covariance of β _m Λ _m .

さらに、Λ_ｍの変分事後分布ｑ（Λ_ｍ）は以下の様に記述可能である。

ただし、以下を満たす。

つまり、ｑ（Λ_ｍ）はν_０およびＢ_ｍをパラメータとするＷｉｓｈａｒｔ分布であることが分かる。 Moreover, lambda _m the variational posterior distribution q (Λ _m) can be described as follows.

However, the following is satisfied.

That is, it can be seen that q (Λ _m ) is a Wishart distribution with ν ₀ and B _m as parameters.

以上によってパラメータμ，Λ，φ，θの変分事後分布ｑ（μ，Λ，φ，θ）が導出できたので、再び、隠れ変数ｍ，ｚの変分事後分布の導出に戻り、パラメータｒ_ｎｔおよびｕ_ｎｍを導出する。 Thus, the variational posterior distribution q (μ, Λ, φ, θ) of the parameters μ, Λ, φ, θ can be derived. Therefore, the process returns to the derivation of the variational posterior distribution of the hidden variables m, z again, and the parameter r _{Deriving nt} and u _nm .

まず、変分事後分布ｑ（ｚ）のｚについての全空間積分値が１であるとの制約条件のもとでＦ［ｑ］を最大化するｑ（ｚ）は、以下のようになる。

ただし、Ｃ_ｚはｑ（ｚ）のｚについての全空間積分値を１とするための規格化定数である。また、φ_ｔｍは状況ｔが音響イベントｍを生成する確率を表す。 First, q (z) that maximizes F [q] under the constraint that the total spatial integration value for z of the variational posterior distribution q (z) is 1 is as follows.

Here, C _z is a normalization constant for setting the total space integral value for _z of q (z) to 1. Φ _tm represents the probability that the situation t generates an acoustic event m.

ここで

として、この部分を計算すると以下のようになる。

ただし、Ψはディガンマ関数を表す。 here

As a result, this part is calculated as follows.

Here, Ψ represents a digamma function.

よって最終的に、式（１）（８）より、音響特徴量列１１−ｓに対応するパラメータｒ_ｓｎ’ｔは以下のように表現できる。

Therefore, finally, from the equations (1) and (8), the parameter r _sn't corresponding to the acoustic feature quantity sequence 11-s can be expressed as follows.

ただし、パラメータｕ_ｎｍのうち、音響特徴量列１１−ｓに対応する時間区間の先頭からｎ’番目（ｎ’＝１，・・・，Ｎ_ｓ）の短時間区間に対応するパラメータをｕ_ｓｎ’ｍとおく。すなわち、以下の関係を満たす。

また、Ｕ_ｓｎ’ｍはｕ_ｓｎ’ｍを用いて以下のように表現される。

However, among the parameters u _nm , the parameters corresponding to the n′-th (n ′ = 1,..., N _s ) short time interval from the beginning of the time interval corresponding to the acoustic feature amount sequence 11-s are set to u _{sn. 'm} . That is, the following relationship is satisfied.

U _sn′m is expressed as follows using u _sn′m .

また、変分事後分布ｑ（ｍ｜ｚ）のｍについての全空間積分値が１であるとの制約条件のもとでＦ［ｑ］を最大化するｑ（ｍ｜ｚ）は、以下のようになる。

ただし、Ｃ_ｍ，ｚはｑ（ｍ，ｚ）の（ｍ，ｚ）についての全空間積分値を１とするための規格化定数である。
この各項をｚの変分事後分布ｑ（ｚ）の場合と同様に算出していくと、以下のようになる。

よって、以下を満たす。

Further, q (m | z) that maximizes F [q] under the constraint that the total spatial integration value for m of the variational posterior distribution q (m | z) is 1 is It becomes like this.

Here, C _{m, z} is a normalization constant for setting the total space integral value for (m, z) of q (m, z) to 1.
If each of these terms is calculated in the same manner as in the case of the variational posterior distribution q (z) of z, the following is obtained.

Therefore, the following is satisfied.

よって最終的に、式（２）（１２）より、パラメータｕ_ｎｍは以下のように表現できる。

Therefore, finally, from the equations (2) and (12), the parameter u _nm can be expressed as follows.

以上より、生成モデルを推定する際は、隠れ変数であるｍ，ｚの変分事後分布とパラメータであるμ，Λ，φ，θの変分事後分布とを上記の式（３）〜（７）（９）〜（１１）（１３）に当てはめて繰り返し更新すれば良いことが分かる。 From the above, when estimating the generation model, the variational posterior distributions of m and z which are hidden variables and the variational posterior distributions of parameters μ, Λ, φ and θ are expressed by the above equations (3) to (7). ) (9) to (11) It is understood that it is only necessary to repeatedly update by applying to (13).

＜生成モデル算出の流れの例＞
（ｉ）まず、状況／音響イベントモデル化部１０２は、Ｓ，Ｔ，Ｍ，Ｄ，Ｎ，Ｎ_ｓを入力とし、ハイパパラメータとしてα，γ，μ_０，β_０，ν_０，Ｂ_０を設定し（例えば、α＝０．３，γ＝０．１，μ_０＝０（全ての要素を0とするベクトル），β_０＝２．０，ν_０＝Ｄ＋１，Ｂ_０＝I（単位行列）等）、これらを用いて、以下のように各変分事後分布のハイパパラメータを初期化する。 <Example of generation model calculation flow>
(I) First, the situation / acoustic event modeling unit 102 receives S, T, M, D, N, and N _s as inputs and sets α, γ, μ ₀ , β ₀ , ν ₀ , and B ₀ as hyperparameters. Set (for example, α = 0.3, γ = 0.1, μ ₀ = 0 (vector in which all elements are 0), β ₀ = 2.0, ν ₀ = D + 1, B ₀ = I (unit) Using these, the hyperparameters of each variational posterior distribution are initialized as follows.

（ｉ−１）状況／音響イベントモデル化部１０２の初期化部１０２ａは、ｓ＝１，・・・，Ｓ、ｔ＝１，・・・・，Ｔに対して、以下を設定する。
α_ｓｔ ^（０）＝α
Ｎ_ｓｔ ^（０）＝Ｎ_ｓ／Ｔ
なお、上付き添え字の（０）はｓｔの真上に記載すべきであるが、記述の制約上ｓｔの右上に表記されている。すなわち、文字「Ｇ」「ｇ１」「ｇ２」についての「Ｇ_ｇ１ ^ｇ２」との表記は、「ｇ２」が「ｇ１」の真上にある表記と同義である。 (I-1) The initialization unit 102a of the situation / acoustic event modeling unit 102 sets the following for s = 1,..., S, t = 1,.
α _st ⁽⁰⁾ = α
N _st ⁽⁰⁾ = N _s / T
The superscript (0) should be described immediately above st, but it is described at the upper right of st due to the restriction of description. That is, the notation “G _g1 ^g2 ” for the letters “G”, “g1”, and “g2” is synonymous with the notation that “g2” is directly above “g1”.

（ｉ−２）状況／音響イベントモデル化部１０２の初期化部１０２ａは、ｔ＝１，２，・・・，Ｔ、ｍ＝１，２，・・・・，Ｍに対して、以下を設定し、さらにｈ＝０とする。
γ_ｔｍ ^（０）＝γ
Ｎ_ｔｍ ^（０）＝Ｎ／（Ｔ×Ｍ）
Ｎ_ｍ ^（０）＝Ｎ／Ｍ
μ_ｍ ^（０）＝μ_０
ν_ｍ ^（０）＝ν_０
Ｂ_ｍ ^（０）＝Ｂ_０
Ｕ_ｓｎ’ｍ ^（０）＝０（零行列）

(I-2) The initialization unit 102a of the situation / acoustic event modeling unit 102 performs the following for t = 1, 2,..., T, m = 1, 2,. Set h = 0.
γ _tm ⁽⁰⁾ = γ
N _tm ⁽⁰⁾ = N / (T × M)
N _m ⁽⁰⁾ = N / M
μ _m ⁽⁰⁾ = μ ₀
ν _m ⁽⁰⁾ = ν ₀
B _m ⁽⁰⁾ = B ₀
U _sn'm ⁽⁰⁾ = 0 (zero matrix)

その後、状況／音響イベントモデル化部１０２は、入力された音響特徴量ｆ_１，・・・，ｆ_Ｎの列を用いて、以下の（ｉｉ−１−１），（ｉｉ−１−２），（ｉｉ−２−１），および（ｉｉ−２−２）を、終了条件が満たされるまで繰り返す。終了条件の例は、（ｉｉ−１−１），（ｉｉ−１−２），（ｉｉ−２−１），および（ｉｉ−２−２）を規定の回数（正値、例えば１〜３０００回程度）繰り返すこと、または、所望の結果が得られこと（例えば、割り当ての前後において、Ｆ（ｑ）の変化が一定の閾値（例えば０．０１％）以下にこと）である。 After that, the situation / acoustic event modeling unit 102 uses the input acoustic feature quantities f ₁ ,..., F _N and the following (ii-1-1), (ii-1-2) , (Ii-2-1), and (ii-2-2) are repeated until the end condition is satisfied. Examples of termination conditions include (ii-1-1), (ii-1-2), (ii-2-1), and (ii-2-2) a predetermined number of times (positive value, for example, 1 to 3000). It is to repeat, or to obtain a desired result (for example, the change of F (q) before and after the assignment is below a certain threshold (for example, 0.01%)).

（ｉｉ−１−１）状況／音響イベントモデル化部１０２の第１更新部１０２ｂは、ｓ＝１，２，・・・，Ｓ、ｎ’＝１，２，・・・，Ｎ_ｓ、ｔ＝１，２，・・・・，Ｔに対して、以下のように隠れ変数ｚの変分事後分布ｑ（ｚ）のパラメータを更新して出力する。なお、ｒ_ｓｎ’ｔ ^（ｈ）はｈ回目の更新で得られたｒ_ｓｎ’ｔであり、Ｒ_ｓｎ’ｔ ^（ｈ）はｈ回目の更新で得られたＲ_ｓｎ’ｔであり、ｕ_ｓｎ’ｍ ^（ｈ）はｈ回目の更新で得られたｕ_ｓｎ’ｍであり、Ｕ_ｓｎ’ｍ ^（ｈ）はｈ回目の更新で得られたＵ_ｓｎ’ｍである。

その後（ｉｉ−１−２）に進む。 The first updating unit 102b of (ii-1-1) status / acoustic event modeling unit 102, s = 1,2, ···, S , n '= 1,2, ···, N s, t = 1, 2,..., T, update the parameter of the variational posterior distribution q (z) of the hidden variable z as follows. Note that r _sn't ^(h) is r _sn't obtained by the h-th update, R _sn't ^(h) is R _sn't obtained by the h-th update, and u _{sn 'm} ^(h) is u _sn'm obtained by the h-th update, and U _sn'm ^(h) is U _sn'm obtained by the h-th update.

Thereafter, the process proceeds to (ii-1-2).

（ｉｉ−１−２）状況／音響イベントモデル化部１０２の第２更新部１０２ｃは、ｎ＝１，２，・・・，Ｎ、ｍ＝１，２，・・・・，Ｍに対して、以下のように隠れ変数ｍの変分事後分布ｑ（ｍ｜ｚ）のパラメータを更新して出力する。

その後（ｉｉ−２−１）に進む。 (Ii-1-2) The second update unit 102c of the situation / acoustic event modeling unit 102 performs processing for n = 1, 2,..., N, m = 1, 2,. The parameter of the variational posterior distribution q (m | z) of the hidden variable m is updated and output as follows.

Thereafter, the process proceeds to (ii-2-1).

（ｉｉ−２−１）状況／音響イベントモデル化部１０２の第３更新部１０２ｄは、ｓ＝１，２，・・・，Ｓ、ｎ’＝１，２，・・・，Ｎ_ｓ、ｔ＝１，２，・・・・，Ｔに対して、以下のようにパラメータθの変分事後分布ｑ（θ）のパラメータを更新して出力する。

その後（ｉｉ−２−２）に進む。 Third updating unit 102d of (ii-2-1) status / acoustic event modeling unit 102, s = 1,2, ···, S , n '= 1,2, ···, N s, t = 1, 2,..., T, the parameter of the variational posterior distribution q (θ) of the parameter θ is updated and output as follows.

Thereafter, the process proceeds to (ii-2-2).

（ｉｉ−２−２）状況／音響イベントモデル化部１０２の第４更新部１０２ｅは、ｎ＝１，２，・・・，Ｎ、ｔ＝１，２，・・・，Ｔ、ｍ＝１，２，・・・・，Ｍに対して、以下のようにパラメータφ，μ，Λの変分事後分布ｑ（φ），ｑ（μ_ｍ｜Λ_ｍ），ｑ（Λ_ｍ）のパラメータを更新して出力する。

(Ii-2-2) The fourth update unit 102e of the situation / acoustic event modeling unit 102 includes n = 1, 2,..., N, t = 1, 2,. , 2,..., M, the variational posterior distributions q (φ), q (μ _m | Λ _m ), q (Λ _m ) of parameters φ, μ, Λ are as follows. Update and output.

その後、状況／音響イベントモデル化部１０２の判定部１０２ｆは終了条件を満たしたかを判定する。終了条件を満たしていない場合、判定部１０２ｆはｈ＋１を新たなｈとして（ｉｉ−１−１）の処理に戻し、第１〜４更新部１０２ｂ〜１０２ｅの処理を再び実行させた後、終了条件を満たしたかを判定する。終了条件を満たした場合には、状況／音響イベントモデル化部１０２のモデル算出部１０２ｇが、第１〜４更新部１０２ｂ〜１０２ｅの何れかで得られた更新後のパラメータを用いて、音響信号−状況生成モデル１２、状況−音響イベント生成モデル１３、および音響イベント−音響特徴量生成モデル１４を算出する。状況／音響イベントモデル化部１０２の解析部１０２ｈが、更新後のパラメータを用いて、状況ラベル列１５を生成してもよいし、音響イベントラベル列１６を生成してもよい。ただし音響信号−状況生成モデル１２や状況ラベル列１５や音響イベントラベル列１６を生成することは必須ではない。状況／音響イベントモデル化部１０２が生成した生成モデルやラベル列は記憶部１０３に格納される。 Thereafter, the determination unit 102f of the situation / acoustic event modeling unit 102 determines whether the end condition is satisfied. When the termination condition is not satisfied, the determination unit 102f sets h + 1 as a new h, returns to the process (ii-1-1), causes the first to fourth update units 102b to 102e to execute again, and then terminates the condition. It is determined whether or not When the termination condition is satisfied, the model calculation unit 102g of the situation / acoustic event modeling unit 102 uses the updated parameters obtained by any of the first to fourth update units 102b to 102e, and the acoustic signal -A situation generation model 12, a situation-acoustic event generation model 13, and an acoustic event-acoustic feature quantity generation model 14 are calculated. The analysis unit 102h of the situation / acoustic event modeling unit 102 may generate the situation label sequence 15 or the acoustic event label sequence 16 using the updated parameters. However, it is not essential to generate the acoustic signal-situation generation model 12, the situation label string 15, and the acoustic event label string 16. The generated model and the label string generated by the situation / acoustic event modeling unit 102 are stored in the storage unit 103.

例えば、状況／音響イベントモデル化部１０２のモデル算出部１０２ｇは、以下のｔ＝１，・・・，Ｔについての以下のＮ_ｓｔを音響信号−状況生成モデル１２として算出してもよいし、ｍ＝１，・・・，Ｍ、ｔ＝１，・・・，Ｔについての以下のＮ_ｔｍを状況−音響イベント生成モデル１３として算出してもよいし、ｍ＝１，・・・，Ｍについての以下のν_ｍ ^（ｈ）を平均、Σ_μｍ ^（ｈ）を分散、ｇ_μｍ ^（ｈ）を自由度とするＳｔｕｄｅｎｔ−ｔ分布に従う確率密度関数を音響イベント−音響特徴量生成モデル１４としてもよい。ただし、下付き添え字の「μｍ」はμ_ｍを表す。

For example, the model calculation unit 102g of the situation / acoustic event modeling unit 102 may calculate the following N _st for the following t = 1,..., T as the acoustic signal-situation generation model 12. The following N _tm for m = 1,..., M, t = 1,..., T may be calculated as the situation-acoustic event generation model 13, or m = 1,. The probability density function according to the Student-t distribution with the following ν _m ^(h) as mean, Σ _μm ^(h) as variance, and g _μm ^(h) as degrees of freedom is also used as the acoustic event-acoustic feature generation model 14 Good. However, "μm" in the subscript represents the μ _m.

また例えば、状況／音響イベントモデル化部１０２の解析部１０２ｈは、音響特徴量列１１−ｓに対応する時間区間の先頭からｎ’番目の短時間区間の音響特徴量に対してａｒｇｍａｘ_ｔＲ_ｓｎ’ｔを算出し、それらを並べた状況ラベル列１５や、音響特徴量列１１に対応する時間区間の先頭からｎ番目の短時間区間の音響特徴量に対してａｒｇｍａｘ_ｍＵ_ｎｍを算出し、それらを並べた音響イベントラベル列１６を出力しても良い。 Further, for example, the analysis unit 102h of the situation / acoustic event modeling unit 102 sets argmax _t R _sn for the acoustic feature amount in the n′-th short-time interval from the beginning of the time interval corresponding to the acoustic feature amount sequence 11-s. _'t is calculated, and argmax _m U _nm is calculated with respect to the acoustic feature quantity in the n-th short time section from the beginning of the time section corresponding to the situation label string 15 and the acoustic feature quantity string 11 arranged in order, You may output the acoustic event label row | line | column 16 which arranged them.

以上のように本実施例では、状況/音響イベントモデル化部１０２において、音響信号が状況を生成する確率や、状況が音響イベントを生成する確率のみではなく、音響イベントが音響特徴量を生成する確率の学習をも同時に行うことができる。その結果、音響イベント間の類似度を精度良く生成モデルに組み込むことができる。また、上記更新の結果で割り当てられた状況や音響イベントを分析することで、各音響特徴量がどの状況や音響イベントにより生成されたものかを知ることも可能である。 As described above, in the present embodiment, in the situation / acoustic event modeling unit 102, not only the probability that an acoustic signal generates a situation or the probability that a situation generates an acoustic event, but also an acoustic event generates an acoustic feature. Probability learning can be performed simultaneously. As a result, the similarity between acoustic events can be accurately incorporated into the generation model. Further, by analyzing the situation and acoustic event assigned as a result of the update, it is also possible to know which situation and acoustic event each acoustic feature amount is generated by.

［実施例１−２］
実施例１−２では、音響信号列を入力として、学習処理によって、音響信号−状況生成モデル１２、状況−音響イベント生成モデル１３、および音響イベント−音響特徴量生成モデル１４を算出する。さらに、状況ラベル列１５を生成してもよいし、音響イベントラベル列１６を生成してもよい。ただし、状況／音響イベントモデル化部１０２が、音響信号−状況生成モデル１２や状況ラベル列１５や音響イベントラベル列１６を生成することは必須ではない。以降、同一のものには同じ参照符号を付し、説明は繰り返さない。 [Example 1-2]
In Example 1-2, the acoustic signal sequence is input, and the acoustic signal-situation generation model 12, the situation-acoustic event generation model 13, and the acoustic event-acoustic feature generation model 14 are calculated by learning processing. Furthermore, the situation label sequence 15 may be generated, or the acoustic event label sequence 16 may be generated. However, it is not essential for the situation / acoustic event modeling unit 102 to generate the acoustic signal-situation generation model 12, the situation label sequence 15, and the acoustic event label sequence 16. Hereinafter, the same reference numerals are given to the same components, and description thereof will not be repeated.

図２に例示するように、本形態のモデル処理装置１２０は、特徴量算出部１１１、音響特徴量列合成部１０１、状況／音響イベントモデル化部１０２、及び記憶部１０３を有する。モデル処理装置１２０は、例えば、公知又は専用のコンピュータに所定のプログラムが読み込まれることで構成される。 As illustrated in FIG. 2, the model processing apparatus 120 of this embodiment includes a feature amount calculation unit 111, an acoustic feature amount sequence synthesis unit 101, a situation / acoustic event modeling unit 102, and a storage unit 103. The model processing device 120 is configured, for example, by reading a predetermined program into a known or dedicated computer.

まず特徴量算出部１１１に音響信号列１０−１，・・・，１０−Ｓが入力される。各音響信号列１０−ｓ（ただし、ｓ∈｛１，・・・，Ｓ｝）は、短時間区間ごとに区分された要素からなり、各要素には要素番号が付されている。 First, acoustic signal sequences 10-1,..., 10-S are input to the feature amount calculation unit 111. Each acoustic signal sequence 10-s (where sε {1,..., S}) is composed of elements divided for each short time section, and each element is assigned an element number.

特徴量算出部１１１は、各音響信号列１０−ｓから、音響特徴量列１０−ｓを算出して出力する。各音響特徴量は複数個の要素からなるベクトルであってもよいし、単数の要素からなるスカラーであってもよい。例えば特徴量算出部１１１は、入力された音響信号列１０−ｓに対し、前述の短時間区間からなるフレームごとに、音圧レベル、音響パワー、ＭＦＣＣ特徴量、ＬＰＣ特徴量などを算出し、これらを音響特徴量列として出力する。さらに立ち上がり特性、調波性、時間周期性などの音響特徴量が音響特徴量列に加えられてもよい。各音響特徴量列１１−ｓには音響特徴量列番号ｓが付与される。 The feature amount calculation unit 111 calculates and outputs an acoustic feature amount sequence 10-s from each acoustic signal sequence 10-s. Each acoustic feature amount may be a vector composed of a plurality of elements, or a scalar composed of a single element. For example, the feature amount calculation unit 111 calculates a sound pressure level, an acoustic power, an MFCC feature amount, an LPC feature amount, and the like for each frame including the above-described short time interval for the input acoustic signal sequence 10-s. These are output as an acoustic feature quantity sequence. Furthermore, acoustic feature quantities such as rising characteristics, harmonicity, and time periodicity may be added to the acoustic feature quantity sequence. Each acoustic feature quantity column 11-s is assigned an acoustic feature quantity column number s.

立ち上がり特性とは、数十から数百ミリ秒ごとにおける、音響信号の大きさを表す指標の増加の度合いを表す指標である。ここで、音響信号の大きさを表す指標とは、例えば、音響信号の振幅の絶対値、音響信号の振幅の絶対値の対数値、音響信号のパワー又は音響信号のパワーの対数値である。例えば、以下の式で得られる値が０以上であればその値が立ち上がり特性とされ、以下の式で得られる値が０未満であれば０が立ち上がり特性とされる。

ただし、ｋ’はフレームをＫ’個の微小な時間区間（例えば１ｍｓｅｃ程度）に区分した場合の各時間区間に対応し、ｐ￣_ｋ’はｋ’番目の時間区間でのサンプルの大きさを表す指標の代表値又は平均値を表す。なお、「サンプルの大きさを表す指標」の例は、サンプルの振幅、サンプルの振幅の絶対値、サンプルの振幅の対数値、サンプルのエネルギー、サンプルのパワー、又はサンプルのパワーの対数値などである。「サンプル」は音響信号列の各音響信号を表す。また、Δｐ￣_ｋ’はｐ￣_ｋ’の変化率を表す。例えば、Δｐ⁻ _ｋ’＝ｐ⁻ _ｋ’−ｐ⁻ _ｋ’−１である。Δｐ⁻ _ｋ’＝ｐ⁻ _ｋ’＋１−ｐ⁻ _ｋ’としてもよい。また、最小二乗法等の近似手法を用いてｋ’番目の時間区間におけるｐ⁻ _ｋ’を近似した直線を求め、その時間区間におけるその直線の傾きをΔｐ⁻ _ｋ’としてもよい。また、ｋ’番目の時間区間を含む複数の時間区間におけるｐ￣_ｋ’−κ，・・・，ｐ￣_ｋ’−１，ｐ⁻ _ｋ’，ｐ￣_ｋ’＋１，．．．，ｐ￣_{ｋ’−κ’}の近時曲線を求め、そのｋ’番目の時間区間に対応する点での傾き（微分値）をΔｐ⁻ _ｋ’としてもよい。またχを任意の文字として、χの右肩の「−」は、χの上付きバーを意味する。また分子における（ｐ￣_ｋ’）^２を（ｐ￣_’）^ｍ’とし、ｍ’を任意の値としても良い。 The rising characteristic is an index representing the degree of increase in the index representing the magnitude of the acoustic signal every several tens to several hundreds of milliseconds. Here, the index representing the magnitude of the acoustic signal is, for example, an absolute value of the amplitude of the acoustic signal, a logarithmic value of the absolute value of the amplitude of the acoustic signal, a power of the acoustic signal, or a logarithmic value of the power of the acoustic signal. For example, if the value obtained by the following expression is 0 or more, the value is the rising characteristic, and if the value obtained by the following expression is less than 0, 0 is the rising characteristic.

However, k ′ corresponds to each time interval when the frame is divided into K ′ minute time intervals (for example, about 1 msec), and p _{′ k ′} indicates the sample size in the k′-th time interval. The representative value or average value of the index to be represented is represented. Examples of “index indicating sample size” are sample amplitude, absolute value of sample amplitude, logarithm of sample amplitude, sample energy, sample power, logarithm of sample power, etc. is there. “Sample” represents each acoustic signal in the acoustic signal sequence. Δp￣k _′ represents the rate of change of p￣k _′ . For ^{_{^{example, Δp - k '= p -}}} k' -p - a _k'-1. ^{_{^{_{Δp - k '= p - k}}}} ' + 1 -p - k ' may be. Alternatively, an approximation method such as a least square method may be used to obtain a straight line that approximates p ⁻ _{k ′} in the k′th time interval, and the slope of the straight line in the time interval may be Δp ⁻ _{k ′} . In addition, p￣k' _-κ , ..., p￣k' _-1 , p ^- _{k '} , p￣k _{' + 1} ,... In a plurality of time intervals including the k'th time interval. . . , P￣ _{k′−κ ′} , and a slope (differential value) at a point corresponding to the _k′- th time interval may be Δp ⁻ _{k ′} . Further, with χ as an arbitrary character, “−” on the right shoulder of χ means a superscript bar of χ. Further, (p￣k _′ ) ² in the molecule may be (p￣ _′ ) ^{m ′,} and m ′ may be an arbitrary value.

以下に調波性を例示する。

また、Ｎ”はフレームに含まれるサンプル数を表す１以上の整数、ｎ”はフレーム内の各サンプル点を表す１以上のＮ”以下の整数、ｘ（ｎ”）はサンプル点ｎ”でのサンプルの大きさを表す指標である。Ｒ_ｆｆ（τ”）はｆ（ｎ”）のラグτ”での自己相関係数、ｍａｘ｛・｝は「・」の最大値を表す。ラグτは１以上Ｎ以下の整数である。Ｒ_ｆｆ（τ”）は、例えば以下のように定義される。

The harmonic characteristics are exemplified below.

N ″ is an integer of 1 or more representing the number of samples included in the frame, n ″ is an integer of 1 or more of N ″ representing each sample point in the frame, and x (n ″) is a sample point n ″. R _ff (τ ″) is an autocorrelation coefficient at the lag τ ″ of f (n ″), and max {•} represents the maximum value of “•”. The lag τ is an integer from 1 to N. R _ff (τ ″) is defined as follows, for example.

以下に時間周期性を例示する。

ただし、Ｌ”は一周期とみなすサンプル数、Ｍ”は時間周期性の度合を計算するための周期数を表す１以上の整数、ｐ”（・）はサンプルの大きさを表す指標を時間平滑化した値、ｐ￣はフレーム内でのサンプルの大きさを表す指標の平均値を表す。 The time periodicity is exemplified below.

Where L ″ is the number of samples regarded as one period, M ″ is an integer of 1 or more representing the number of periods for calculating the degree of time periodicity, and p ″ (•) is a time smoothing index representing the sample size. The converted value, p￣, represents an average value of an index representing the size of the sample in the frame.

次に、音響特徴量列合成部１０１に、音響特徴量列１１−１，・・・，１１−Ｓ（ただし、Ｓは１以上の整数）が入力される。複数個の音響特徴量列１１−１，・・・，１１−Ｓが音響特徴量列合成部１０１に入力された場合、音響特徴量列合成部１０１は、それらを時系列方向（例えば、時系列順）につなぎ合わせ、それによって１つの音響特徴量列１１を得て出力する。音響特徴量列合成部１０１に１つの音響特徴量列１１−１のみが入力された場合、音響特徴量列合成部１０１はそれを音響特徴量列１１として出力する。音響特徴量列合成部１０１から出力された音響特徴量列１１は、状況／音響イベントモデル化部１０２に入力される。なお、音響特徴量列合成部１０１を経由することなく、１つ音響特徴量列１１がそのまま状況／音響イベントモデル化部１０２に入力されてもよい。或いは、音響特徴量列１１−１，・・・，１１−Ｓを生成した後に、それらを合成して音響特徴量列１１を得ることに代えて、音響信号列１０−１，・・・，１０−Ｓを時系列方向（例えば、時系列順）に合成した音響信号列１０を得た後に、音響信号列１０から音響特徴量列１１を生成してもよい。これ以降の処理は実施例１−１と同じであるため、説明を省略する。 Next, the acoustic feature quantity sequence 11-1,..., 11 -S (where S is an integer equal to or greater than 1) is input to the acoustic feature quantity sequence synthesis unit 101. When a plurality of acoustic feature value sequences 11-1,..., 11-S are input to the acoustic feature value sequence synthesizing unit 101, the acoustic feature value sequence synthesizing unit 101 converts them into a time-series direction (for example, time (Sequence order), thereby obtaining and outputting one acoustic feature string 11. When only one acoustic feature amount sequence 11-1 is input to the acoustic feature amount sequence combining unit 101, the acoustic feature amount sequence combining unit 101 outputs it as the acoustic feature amount sequence 11. The acoustic feature quantity sequence 11 output from the acoustic feature quantity sequence synthesis unit 101 is input to the situation / acoustic event modeling unit 102. Note that one acoustic feature quantity sequence 11 may be directly input to the situation / acoustic event modeling unit 102 without going through the acoustic feature quantity sequence synthesis unit 101. Alternatively, after generating the acoustic feature sequence 11-1,..., 11-S, the acoustic feature sequence 11 is obtained by synthesizing them to obtain the acoustic feature sequence 11. The acoustic feature quantity sequence 11 may be generated from the acoustic signal sequence 10 after obtaining the acoustic signal sequence 10 obtained by combining 10-S in the time series direction (for example, in time series order). Since the subsequent processing is the same as that of Example 1-1, description thereof is omitted.

［実施例２−１］
実施例２−１では、実施例１−１で説明したように得られた状況−音響イベント生成モデル１３および音響イベント−音響特徴量生成モデル１４を用い、新たに入力された音響信号列から状況を推定する。 [Example 2-1]
In Example 2-1, the situation-acoustic event generation model 13 and the acoustic event-acoustic feature amount generation model 14 obtained as described in Example 1-1 were used, and the situation was newly input from the acoustic signal sequence. Is estimated.

図３に例示するように、本形態のモデル処理装置２１０は、記憶部２０３及び生成モデル比較部２０１を有する。生成モデル比較部２０１は、例えば、音響イベント推定部２０１ａおよび比較部２０１ｂを有する。モデル処理装置２１０は、例えば、公知又は専用のコンピュータに所定のプログラムが読み込まれることで構成される。また記憶部２０３には、実施例１−１で説明したように得られた状況−音響イベント生成モデル１３および音響イベント−音響特徴量生成モデル１４が格納されている。 As illustrated in FIG. 3, the model processing apparatus 210 according to this embodiment includes a storage unit 203 and a generated model comparison unit 201. The generation model comparison unit 201 includes, for example, an acoustic event estimation unit 201a and a comparison unit 201b. The model processing device 210 is configured, for example, by reading a predetermined program into a known or dedicated computer. The storage unit 203 stores the situation-acoustic event generation model 13 and the acoustic event-acoustic feature generation model 14 obtained as described in the example 1-1.

音響イベントの種類の総数Ｍ、状況の種類の総数Ｔ、音響特徴量列２１が生成モデル比較部２０１に入力される。音響特徴量列２１は、１個の音響特徴量または２個以上の音響特徴量を時系列方向（例えば、時系列順）につなぎ合わせた列である。実施例１−１で説明したように、各音響特徴量は、短時間区間ごとの音響信号から得られたものである。各音響特徴量は複数個の要素からなるベクトルであってもよいし、単数の要素からなるスカラーであってもよい。生成モデル比較部２０１は、例えば、入力された情報を用い、音響特徴量列２１と、状況−音響イベント生成モデル１３とを比較し、最も近いと判断された状況、若しくは近いと判断された状況から複数個、またはある尤度よりも高いと判断された状況を判定結果として出力する。また、生成モデル比較部２０１が、音響特徴量列２１と音響イベント−音響特徴量生成モデル１４とを用い、音響特徴量列２１に対応する音響イベント列を推定して出力してもよい。以下に、生成モデル比較部２０１の処理を例示する。 The total number M of acoustic event types, the total number T of situation types, and the acoustic feature amount column 21 are input to the generation model comparison unit 201. The acoustic feature amount column 21 is a column in which one acoustic feature amount or two or more acoustic feature amounts are connected in a time series direction (for example, in time series order). As described in Example 1-1, each acoustic feature amount is obtained from an acoustic signal for each short time section. Each acoustic feature amount may be a vector composed of a plurality of elements, or a scalar composed of a single element. The generation model comparison unit 201 compares, for example, the acoustic feature quantity sequence 21 and the situation-acoustic event generation model 13 using the input information, and the situation determined to be closest or the situation determined to be close. The situation determined to be plural or higher than a certain likelihood is output as a determination result. Further, the generation model comparison unit 201 may estimate and output an acoustic event sequence corresponding to the acoustic feature amount sequence 21 using the acoustic feature amount sequence 21 and the acoustic event-acoustic feature amount generation model 14. Below, the process of the production | generation model comparison part 201 is illustrated.

まず、生成モデル比較部２０１の音響イベント推定部２０１ａは、記憶部２０３から読み込んだ音響イベント−音響特徴量生成モデル１４を用い、音響特徴量列２１を構成する各音響特徴量について確率Ｐ（音響特徴量｜音響イベント）を最大にする音響イベント列（音響イベント判定結果）を得て出力する。例えば、音響特徴量列２１の音響イベント推定部は、以下のように音響イベント列ｍ_１，・・・，ｍ_Ｎ’を得る。

ただし、ｆ_ｉは音響特徴量列２１に対応する時間区間の先頭からｉ番目（ｉ＝１，・・・，Ｎ’）の短時間区間に対応する音響特徴量を表し、音響特徴量列２１は音響特徴量ｆ_１，・・・，ｆ_Ｎ’の列である。ｍ_ｉは音響特徴量列２１に対応する時間区間の先頭からｉ番目の短時間区間に対応する音響イベントを表す。また、Ｎ’は正の整数であり、音響特徴量列２１に対応する時間区間が含む短時間区間の数を表す。Ｎ’＝Ｎであってもよいし、Ｎ’≠Ｎであってもよい。ｐ（ｆ_ｉ｜ｍ_ｉ，μ_ｍ，Λ_ｍ）は音響イベント−音響特徴量生成モデル１４から得られる。例えばｐ（ｆ_ｉ｜ｍ_ｉ，μ_ｍ，Λ_ｍ）はν_ｍ ^（ｈ）を平均、Σ_μｍ ^（ｈ）を分散、ｇ_μｍ ^（ｈ）を自由度とするＳｔｕｄｅｎｔ−ｔ分布に従う確率密度関数によって算出可能である。ｐ（ｍ_ｉ）は予め定められた事前確率である。また、音響イベント推定部２０１ａは、音響特徴量列２１を構成する各音響特徴量について確率Ｐ（音響特徴量｜音響イベント）が大きい方から選択された複数個の音響イベントからなる音響イベント列を音響イベント判定結果としてもよいし、当該確率Ｐ（音響特徴量｜音響イベント）が閾値以上（又は閾値を超える）１個または複数個の音響イベントからなる音響イベント列を音響イベント判定結果としてもよい。 First, the acoustic event estimation unit 201 a of the generation model comparison unit 201 uses the acoustic event-acoustic feature amount generation model 14 read from the storage unit 203 and uses the probability P (acoustic value) for each acoustic feature amount constituting the acoustic feature amount sequence 21. Obtain and output an acoustic event sequence (acoustic event determination result) that maximizes the feature value | acoustic event. For example, the acoustic event estimation unit of the acoustic feature quantity sequence 21 obtains acoustic event sequences m ₁ ,..., _{M N ′} as follows.

However, f _i represents the acoustic feature quantity corresponding to the i-th (i = 1,..., N ′) short time section from the beginning of the time section corresponding to the acoustic feature quantity sequence 21, and the acoustic feature quantity sequence 21. Is a row of acoustic feature quantities f ₁ ,..., F _{N ′} . m _i represents an acoustic event corresponding to the i-th short time interval from the beginning of the time interval corresponding to the acoustic feature string 21. N ′ is a positive integer and represents the number of short time sections included in the time section corresponding to the acoustic feature quantity sequence 21. N ′ = N may be satisfied, or N ′ ≠ N may be satisfied. p (f _i | m _i , μ _m , Λ _m ) is obtained from the acoustic event-acoustic feature quantity generation model 14. For example, p (f _i | m _i , μ _m , Λ _m ) is a probability density function according to a Student-t distribution with ν _m ^(h) as an average, Σ _μm ^(h) as variance, and g _μm ^(h) as degrees of freedom. Can be calculated. p (m _i ) is a predetermined prior probability. In addition, the acoustic event estimation unit 201a generates an acoustic event sequence including a plurality of acoustic events selected from the one having the larger probability P (acoustic feature amount | acoustic event) for each acoustic feature amount constituting the acoustic feature amount sequence 21. It is good also as an acoustic event determination result, and the said probability P (acoustic feature-value | acoustic event) is good also as an acoustic event determination result as an acoustic event sequence which consists of one or several acoustic events more than a threshold value (or exceeds a threshold value). .

生成モデル比較部２０１の比較部２０１ｂは、音響イベント推定部２０１ａで得られた音響イベント列ｍ_１，・・・，ｍ_Ｎ’から得られる音響イベントの分布と、状況−音響イベント生成モデル１３が表す音響イベントを確率変数としたＰ（音響イベント｜状況）の各状況に対応する分布とを比較し、これらの分布の距離に基づいて音響特徴量列２１に対応する状況または状況の列を推定し、その推定結果を状況判定結果として出力する。なお、音響イベントを確率変数としたＰ（音響イベント｜状況）の各状況に対応する分布は、状況ごとに定まる音響イベントを確率変数としたＰ（音響イベント｜状況）の分布である。例えば、これらの分布が最も近くなる状況を状況判定結果として出力してもよいし、これらの分布が近いほうから選択した複数個の状況を状況判定結果として出力してもよいし、これらの分布の距離が閾値以下（または未満）となる１個または複数個の状況を状況判定結果として出力してもよい。 The comparison unit 201b of the generation model comparison unit 201 includes the distribution of acoustic events obtained from the acoustic event sequence m ₁ ,..., _{M N ′} obtained by the acoustic event estimation unit 201a and the situation-acoustic event generation model 13. The distribution corresponding to each situation of P (acoustic event | situation) with the acoustic event represented as a random variable is compared, and the situation or situation column corresponding to the acoustic feature quantity column 21 is estimated based on the distance of these distributions. Then, the estimation result is output as a situation determination result. In addition, the distribution corresponding to each situation of P (acoustic event | situation) using the acoustic event as a random variable is a distribution of P (acoustic event | situation) using the acoustic event determined for each situation as a random variable. For example, the situation in which these distributions are closest may be output as the situation determination result, or a plurality of situations selected from the closest to these distributions may be output as the situation determination results. One or a plurality of situations in which the distance is equal to or less than (or less than) the threshold may be output as the situation determination result.

＜比較部２０１ｂの処理の具体例１＞
まず比較部２０１ｂが、入力された音響イベント列から、以下のように音響イベントの分布ｐ’（ｍ）（ただし、ｍ∈｛１，・・・，Ｍ｝）を算出する。

ただし、γ’は事前に設定された緩和パラメータ（例えば０．０１などの非負値）を表し、Ｃ_ｍは、入力された音響イベント列のうち音響イベントｍを表す音響イベントの個数を表す。 <Specific Example 1 of Processing of Comparison Unit 201b>
First, the comparison unit 201b calculates an acoustic event distribution p ′ (m) (where m∈ {1,..., M}) from the input acoustic event sequence as follows.

However, γ ′ represents a preset relaxation parameter (for example, a non-negative value such as 0.01), and C _m represents the number of acoustic events representing the acoustic event _m in the input acoustic event sequence.

次に比較部２０１ｂは、ｐ’（ｍ）と状況−音響イベント生成モデル１３を、下記に記すカルバックライブラー情報量（Kullback-Leibler divergence: KL divergence）やイェンセンシャノン情報量（Jensen-Shannon divergence: JS divergence）などの情報量基準に基づいて比較することで、入力された音響イベント列ｍ_１，・・・，ｍ_Ｎ’に対応する状況を推定する。

Next, the comparison unit 201b converts p ′ (m) and the situation-acoustic event generation model 13 into a Cullback library information amount (Kullback-Leibler divergence: KL divergence) and a Jensen-Shannon information amount (Jensen-Shannon divergence: JS divergence) and the like are estimated based on information criteria, and the situation corresponding to the input acoustic event sequence m ₁ ,..., _{M N ′} is estimated.

式（１５）又は（１６）の例の場合、比較部２０１ｂは、Ｐ（ｍ）にｐ’（ｍ）（ただし、ｍ＝１，・・・，Ｍ）を代入し、Ｑ_ｔ（ｍ）にＮ_ｔｍ（ただし、ｍ＝１，・・・，Ｍ，ｔ＝１，・・・，Ｔ）（音響イベントｍ＝１，・・・，Ｍを確率変数とした確率Ｐ（音響イベントｍ｜状況ｔ）の各状況ｔに対応する分布）を代入する。これにより、比較部２０１ｂは、各状況ｔ＝｛１，・・・，Ｔ｝に対応する情報量（合計Ｔ個の情報量）を得る。比較部２０１ｂは、各状況ｔ＝｛１，・・・，Ｔ｝について算出された情報量のうち、最も小さな情報量に対応する状況、または、最も小さな情報量から順番に選択した複数個の情報量に対応する複数個の状況、または、閾値以下（又は未満）の１個または複数個に対応する状況を、音響特徴量列２１に対応する状況（状況判定結果）として出力する。 In the case of the example of Expression (15) or (16), the comparison unit 201b substitutes p ′ (m) (where m = 1,..., M) for P (m), and Q _t (m) N _tm (where m = 1,..., M, t = 1,..., T) (acoustic event m = 1,..., Probability P (acoustic event m | The distribution corresponding to each situation t) of situation t) is substituted. Thereby, the comparison unit 201b obtains the information amount (total T information amount) corresponding to each situation t = {1,..., T}. The comparison unit 201b includes a situation corresponding to the smallest information amount among the information amounts calculated for each situation t = {1,..., T}, or a plurality of items selected in order from the smallest information amount. A plurality of situations corresponding to the amount of information or a situation corresponding to one or more than (or less than) a threshold value is output as a situation (situation determination result) corresponding to the acoustic feature quantity column 21.

＜比較部２０１ｂの処理の具体例２＞
比較部２０１ｂは、状況−音響イベント生成モデル１３と入力された音響イベント列との比較を以下のように行ってもよい。この手法では、比較部２０１ｂが、入力された音響イベント列に対し、状況−音響イベント生成モデル１３のもとでの状況の尤度の和や積を求める。比較部２０１ｂは、尤度の和や積が最大となる状況を状況判定結果として出力してもよいし、尤度の和や積が大きい順に選択した複数個の状況を状況判定結果として出力してもよいし、尤度の和や積が閾値以上（又は閾値を超える）の１個または複数個の状況を、状況判定結果として出力してもよい。 <Specific Example 2 of Processing of Comparison Unit 201b>
The comparison unit 201b may perform a comparison between the situation-acoustic event generation model 13 and the input acoustic event sequence as follows. In this method, the comparison unit 201b calculates the sum or product of the likelihood of the situation under the situation-acoustic event generation model 13 for the input acoustic event sequence. The comparison unit 201b may output the situation where the sum or product of the likelihood is the maximum as the situation determination result, or output a plurality of situations selected in descending order of the likelihood sum or product as the situation determination result. Alternatively, one or a plurality of situations in which the sum or product of likelihoods is greater than or equal to the threshold (or exceeds the threshold) may be output as the situation determination result.

≪状況−音響イベント生成モデル１３のもとでの状況ｔの尤度の和の例≫

ただし、ｚ_ｉは音響特徴量列２１に対応する時間区間の先頭からｉ番目の短時間区間に対応する状況を表し、ｍ_ｉは音響特徴量列２１に対応する時間区間の先頭からｉ番目の短時間区間に対応する音響イベントを表す。 << Situation-Example of sum of likelihood of situation t under acoustic event generation model 13 >>

However, the z _i represents the situation corresponding to the i-th short interval from the beginning of the time interval corresponding to the acoustic feature sequence 21, m _i is the i-th from the head of the time interval corresponding to the acoustic feature sequence 21 Represents an acoustic event corresponding to a short period.

≪状況−音響イベント生成モデル１３のもとでの状況ｔの尤度の積の例≫

<< Situation-Example of likelihood product of situation t under acoustic event generation model 13 >>

［実施例２−２］
実施例２−２では、実施例１−１で説明したように得られた状況−音響イベント生成モデル１３および音響イベント−音響特徴量生成モデル１４を用い、新たに入力された音響信号列から状況を推定する。 [Example 2-2]
In Example 2-2, the situation-acoustic event generation model 13 and the acoustic event-acoustic feature quantity generation model 14 obtained as described in Example 1-1 are used, and the situation is obtained from a newly input acoustic signal sequence. Is estimated.

図４に例示するように、本形態のモデル処理装置２２０は、記憶部２０３、特徴量算出部２１１、及び生成モデル比較部２０１を有する。モデル処理装置２２０は、例えば、公知又は専用のコンピュータに所定のプログラムが読み込まれることで構成される。 As illustrated in FIG. 4, the model processing device 220 according to the present embodiment includes a storage unit 203, a feature amount calculation unit 211, and a generated model comparison unit 201. The model processing device 220 is configured, for example, by reading a predetermined program into a known or dedicated computer.

まず特徴量算出部２１１に音響信号列２０が入力される。音響信号列２０は、短時間区間ごとに区分された要素からなり、各要素には要素番号が付されている。特徴量算出部２１１は、音響信号列２０から前述のように音響信号列２１を算出して出力する。音響信号列２１は、生成モデル比較部２０１に入力される。これ以降の処理は実施例２−１と同じであるため説明を省略する。 First, the acoustic signal sequence 20 is input to the feature amount calculation unit 211. The acoustic signal sequence 20 is composed of elements divided for each short time section, and each element is assigned an element number. The feature amount calculation unit 211 calculates and outputs the acoustic signal sequence 21 from the acoustic signal sequence 20 as described above. The acoustic signal sequence 21 is input to the generation model comparison unit 201. Since the subsequent processing is the same as that of the embodiment 2-1, the description is omitted.

［実施例３−１］
実施例３−１は、実施例１−１と実施例２−１との組み合わせである。
本実施例では、音響特徴量列１１−１，・・・，１１−Ｓ，２１を入力として、学習処理によって、状況−音響イベント生成モデル、および音響イベント−音響特徴量生成モデルを算出する。また、この学習処理によって、さらに音響信号−状況生成モデルを生成してもよい。さらに、生成した音響信号−状況生成モデル１２、及び状況−音響イベント生成モデル１３を用い、音響特徴量列２１から状況を推定する。 [Example 3-1]
Example 3-1 is a combination of Example 1-1 and Example 2-1.
In the present embodiment, the situation-acoustic event generation model and the acoustic event-acoustic feature quantity generation model are calculated by learning processing with the acoustic feature quantity sequences 11-1,..., 11-S, 21 as inputs. Further, an acoustic signal-situation generation model may be further generated by this learning process. Furthermore, the situation is estimated from the acoustic feature quantity sequence 21 using the generated acoustic signal-situation generation model 12 and the situation-acoustic event generation model 13.

図５に例示するように、実施例のモデル処理装置３１０は、音響特徴量列合成部１０１、状況／音響イベントモデル化部１０２、生成モデル比較部２０１、及び記憶部１０３，３０３を有する。モデル処理装置３１０は、例えば、公知又は専用のコンピュータに所定のプログラムが読み込まれることで構成される。 As illustrated in FIG. 5, the model processing apparatus 310 according to the embodiment includes an acoustic feature quantity sequence synthesizing unit 101, a situation / acoustic event modeling unit 102, a generated model comparison unit 201, and storage units 103 and 303. The model processing device 310 is configured, for example, by reading a predetermined program into a known or dedicated computer.

音響特徴量列１１−１，・・・，１１−Ｓ，２１は、音響特徴量列合成部１０１に入力され、音響特徴量列合成部１０１は、実施例１−１と同様に、これらを合成した音響特徴量列１１を得て出力する。音響特徴量列１１は、状況／音響イベントモデル化部１０２に入力され、状況／音響イベントモデル化部１０２は、実施例１−１と同様に、音響信号−状況生成モデル１２、状況−音響イベント生成モデル１３、および音響イベント−音響特徴量生成モデル１４を算出する。さらに、状況／音響イベントモデル化部１０２は、状況ラベル列１５を生成してもよいし、音響イベントラベル列１６を生成してもよい。ただし、状況／音響イベントモデル化部１０２が、音響信号−状況生成モデル１２や状況ラベル列１５や音響イベントラベル列１６を生成することは必須ではない。状況／音響イベントモデル化部１０２が生成したモデルや列は記憶部１０３に格納される。 The acoustic feature quantity sequences 11-1,..., 11-S, 21 are input to the acoustic feature quantity sequence synthesizing unit 101, and the acoustic feature quantity sequence synthesizing unit 101 performs the same processing as in Example 1-1. The synthesized acoustic feature quantity sequence 11 is obtained and output. The acoustic feature quantity column 11 is input to the situation / acoustic event modeling unit 102, and the situation / acoustic event modeling unit 102 is similar to the example 1-1, and the acoustic signal-situation generation model 12, the situation-acoustic event. A generation model 13 and an acoustic event-acoustic feature amount generation model 14 are calculated. Further, the situation / acoustic event modeling unit 102 may generate the situation label string 15 or the acoustic event label string 16. However, it is not essential for the situation / acoustic event modeling unit 102 to generate the acoustic signal-situation generation model 12, the situation label sequence 15, and the acoustic event label sequence 16. The model and sequence generated by the situation / acoustic event modeling unit 102 are stored in the storage unit 103.

音響特徴量列２１は、さらに生成モデル比較部２０１に入力される。生成モデル比較部２０１は、実施例２−１と同様に、音響特徴量列２１と、状況−音響イベント生成モデル１３とを比較し、最も近いと判断された状況、もしくは、近いと判断された状況から複数個、またはある尤度よりも高いと判断された状況を判定結果として出力する。また、生成モデル比較部２０１が、音響特徴量列２１と音響イベント−音響特徴量生成モデル１４とを用い、音響特徴量列２１に対応する音響イベント列を推定して出力してもよい。 The acoustic feature quantity column 21 is further input to the generation model comparison unit 201. The generation model comparison unit 201 compares the acoustic feature quantity sequence 21 and the situation-acoustic event generation model 13 as in the case of Example 2-1, and determines that the situation is the closest or is determined to be close. A plurality of situations or situations judged to be higher than a certain likelihood are output as judgment results. Further, the generation model comparison unit 201 may estimate and output an acoustic event sequence corresponding to the acoustic feature amount sequence 21 using the acoustic feature amount sequence 21 and the acoustic event-acoustic feature amount generation model 14.

なお、生成モデル比較部２０１の処理及び状況／音響イベントモデル化部１０２の処理のどちらを先に行っても良い。ただし、状況／音響イベントモデル化部１０２の処理を行う前に生成モデル比較部２０１の処理を行う場合、記憶部１０３に予め得られた各生成モデルが格納されている必要がある。 Note that either the processing of the generation model comparison unit 201 or the processing of the situation / acoustic event modeling unit 102 may be performed first. However, when the process of the generation model comparison unit 201 is performed before the process of the situation / acoustic event modeling unit 102 is performed, each generation model obtained in advance needs to be stored in the storage unit 103.

また、音響特徴量列２１が、新たに入力された音響特徴量列とともに音響特徴量列合成部１０１に入力されてもよい。この場合、音響特徴量列合成部１０１がこれらを時系列方向（例えば、時系列順）につなぎ合わせ、状況／音響イベントモデル化部１０２に送出してもよい。 In addition, the acoustic feature amount sequence 21 may be input to the acoustic feature amount sequence combining unit 101 together with the newly input acoustic feature amount sequence. In this case, the acoustic feature quantity sequence synthesizing unit 101 may connect them in the time series direction (for example, in time series order) and send them to the situation / acoustic event modeling unit 102.

［実施例３−２］
実施例３−２は、実施例１−２と実施例２−２との組み合わせである。
本実施例では、音響信号列１０−１，・・・，１０−Ｓ，２０を入力として、学習処理によって、状況−音響イベント生成モデル、および音響イベント−音響特徴量生成モデルを算出する。また、この学習処理によって、さらに音響信号−状況生成モデルを生成してもよい。さらに、生成した音響信号−状況生成モデル１２、及び状況−音響イベント生成モデル１３を用い、音響信号列２０から状況を推定する。 [Example 3-2]
Example 3-2 is a combination of Example 1-2 and Example 2-2.
In this embodiment, the acoustic signal trains 10-1,..., 10-S, 20 are input, and the situation-acoustic event generation model and the acoustic event-acoustic feature amount generation model are calculated by learning processing. Further, an acoustic signal-situation generation model may be further generated by this learning process. Further, the situation is estimated from the acoustic signal sequence 20 using the generated acoustic signal-situation generation model 12 and the situation-acoustic event generation model 13.

図６に例示するように、本実施例のモデル処理装置３２０は、特徴量算出部１１１−１，・・・，１１１−Ｓ，２１１、および実施例３−１で説明したモデル処理装置３１０（図５）を有する。 As illustrated in FIG. 6, the model processing device 320 of the present embodiment includes a feature amount calculation unit 111-1,..., 111 -S, 211, and the model processing device 310 ( FIG. 5).

音響信号列１０−１，・・・，１０−Ｓ，２０は、それぞれ特徴量算出部１１１−１，・・・，１１１−Ｓ，２１１に入力される。特徴量算出部１１１−１，・・・，１１１−Ｓ，２１１は、実施例１−２で説明したように、音響信号列１０−１，・・・，１０−Ｓ，２１１から、それぞれ音響特徴量列１０−１，・・・，１０−Ｓ，２１を得て出力する。音響特徴量列１０−１，・・・，１０−Ｓ，２１は、記憶部３０３（図５）に格納される。以降の処理は実施例３−１と同じである。 The acoustic signal trains 10-1,..., 10-S, 20 are input to the feature amount calculation units 111-1,. As described in the embodiment 1-2, the feature amount calculation units 111-1,..., 111-S, 211 are acoustically connected from the acoustic signal trains 10-1,. The feature quantity columns 10-1,..., 10-S, 21 are obtained and output. The acoustic feature quantity columns 10-1,..., 10-S, 21 are stored in the storage unit 303 (FIG. 5). The subsequent processing is the same as in Example 3-1.

［各実施例の特徴］
上述した各実施例では、音響特徴量と状況や音響イベントとの関係のモデルを算出する際に、従来技術では困難であった、音響信号と状況、状況と音響イベント列、および音響イベント列と音響特徴量列の関係を同時に考慮した学習処理によって、音響信号−状況生成モデル１２、状況−音響イベント生成モデル１３、音響イベント−音響特徴量モデル等を生成できる。このように、音響信号と状況、状況と音響イベントの関係に加えて、音響イベントと音響特徴量の関係を同時に考慮することで、音響イベント間の類似度を生成モデルの学習に反映させることができ、音響イベント間の類似度を生成モデルに組み込みことができる。その結果、従来技術よりも精度良く、音響信号と状況との関係をモデル化できる。 [Features of each embodiment]
In each of the embodiments described above, when calculating the model of the relationship between the acoustic feature quantity and the situation or the acoustic event, the acoustic signal and situation, the situation and the acoustic event sequence, and the acoustic event sequence, which were difficult in the prior art, An acoustic signal-situation generation model 12, a situation-acoustic event generation model 13, an acoustic event-acoustic feature amount model, and the like can be generated by learning processing that simultaneously considers the relationship between the acoustic feature amount sequences. Thus, in addition to the relationship between the acoustic signal and the situation, the relationship between the situation and the acoustic event, the relationship between the acoustic event and the acoustic feature amount is considered at the same time, so that the similarity between the acoustic events can be reflected in the learning of the generation model. And the similarity between acoustic events can be incorporated into the generation model. As a result, the relationship between the acoustic signal and the situation can be modeled more accurately than in the prior art.

なお、本発明は上述の各実施例に限定されるものではない。例えば、例えば、生成モデルの作成処理や状況／音響イベント判定処理が複数の装置で分散処理されてもよいし、記憶部１３０，３０３に格納された生成モデルやデータが複数の記憶部に分散して格納されてもよい。また、音響特徴量列や音響信号列が時系列の順に入力され順次処理されるのであれば、短時間区間ごとに区分された各要素に対応する要素番号が、音響特徴量列や音響信号列に含まれなくてもよい。また上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 In addition, this invention is not limited to each above-mentioned Example. For example, for example, the generation model creation process and the situation / acoustic event determination process may be distributed by a plurality of devices, or the generation models and data stored in the storage units 130 and 303 are distributed to a plurality of storage units. May be stored. In addition, if the acoustic feature amount sequence and the acoustic signal sequence are input and processed sequentially in time series, the element number corresponding to each element divided for each short time section is the acoustic feature amount sequence or acoustic signal sequence. May not be included. The various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capacity of the apparatus that executes the processes. Needless to say, other modifications are possible without departing from the spirit of the present invention.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is realized by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. The program describing the processing contents can be recorded on a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such a recording medium are a magnetic recording device, an optical disk, a magneto-optical recording medium, a semiconductor memory, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed, for example, by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記録装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, this computer reads a program stored in its own recording device and executes a process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and each time the program is transferred from the server computer to the computer. The processing according to the received program may be executed sequentially. The above-described processing may be executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. Good.

上記実施形態では、コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されたが、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 In the above embodiment, the processing functions of the apparatus are realized by executing a predetermined program on a computer. However, at least a part of these processing functions may be realized by hardware.

モデル処理装置１１０，１２０，２１０，２２０，３１０，３２０ Model processing apparatus 110, 120, 210, 220, 310, 320

Claims

At least, using the acoustic feature sequence, which is a sequence of time-series acoustic features obtained from the acoustic signal sequence , the total number of types of acoustic events, and the total number of types of situations,
Perform a learning process to search for the maximum value of the simultaneous distribution corresponding to the combination of the acoustic event corresponding to the situation, the combination of the situation corresponding to the acoustic signal sequence, and the acoustic feature amount corresponding to the acoustic event,
A model processing apparatus having a modeling unit that obtains at least a probability P (acoustic event | situation) that a situation generates an acoustic event and a probability P (acoustic feature quantity | acoustic event) that the acoustic event generates an acoustic feature quantity.

The model processing device according to claim 1,
The acoustic feature amount sequence is a sequence of acoustic feature amounts f ₁ ,..., F _N , the total number of types of acoustic events is M, and the total number of types of situations is T.
The modeling unit
Hyper parameters α, γ, μ ₀ , β ₀ , ν ₀ , B ₀ , and initial parameter values α _st ⁽⁰⁾ , N _st ⁽⁰⁾ , γ _tm ⁽⁰⁾ , N _tm ⁽⁰⁾ , N _m ^{( 0)} , μ _m ⁽⁰⁾ , ν _m ⁽⁰⁾ , B _m ⁽⁰⁾ , U _sn'm ⁽⁰⁾ , g _μm ⁽⁰⁾ , Σ _μm ⁽⁰⁾ are set and h = 0 And
Let Ψ be a digamma function,

In the case of

A first update unit for obtaining
When D is an integer constant of 1 or more,

A second updating unit for obtaining

A third update unit to obtain
(・) When ^T is a transpose of (・),

A fourth update unit to obtain
A determination unit that determines whether an end condition is satisfied and determines that the end condition is not satisfied; and a determination unit that re-executes the processing of the first to fourth update units with h + 1 as a new h,
When it is determined that the termination condition is satisfied, the probability P (acoustic event | situation) and the probability P (acoustic feature amount | acoustic event) are obtained from the values obtained by any of the first to fourth updating units. A model calculation unit for outputting
A model processing apparatus.

The model processing apparatus according to claim 1 or 2, wherein
A model processing apparatus further comprising an acoustic feature amount calculation unit that obtains and outputs the acoustic feature amount sequence from the input acoustic signal sequence.

Using at least the total number of types of acoustic events, the total number of types of situations, the acoustic feature quantity sequence, the probability P (acoustic feature quantity | acoustic event) and the probability P (acoustic event | situation) of any one of claims 1 to 3 ,
An acoustic event estimation unit that obtains an acoustic event sequence that maximizes the probability P (acoustic feature amount | acoustic event) for the acoustic feature amount sequence;
Corresponding to the acoustic feature value sequence based on the distance between the distribution of the acoustic event obtained from the acoustic event sequence and the distribution corresponding to each situation of the probability P (acoustic event | situation) using the acoustic event as a random variable A comparison unit that obtains a situation or a sequence of situations,
A model processing apparatus.

The model processing device according to claim 4,
The acoustic event estimation unit sets the acoustic feature quantity sequence as a sequence of acoustic feature quantities f ₁ ,..., F _{N ′} , sets the total number of types of acoustic events as M, and i = 1,. Where N is a positive integer, p (m _i ) is a predetermined prior probability, and μ _m and Λ _m are model parameters.

_M 1, ···, model processing apparatus for obtaining m _{N 'as} the acoustic event sequence comprising a.

The model processing device according to claim 4,
The comparison unit sets M as the total number of types of acoustic events, T as the total number of types of situations, and P (m) as the distribution of the acoustic events m = 1,. m = 1, ···, probability and random variable the M P (acoustic event | situation) = P | each situation of (m t) t = 1, ···, the corresponding distribution to T _Q t (m )

Or

A model processing apparatus that obtains a situation or a series of situations corresponding to the acoustic feature quantity series.

At least the total number of types of acoustic events, the total number of types of situations, and some of the time-series acoustic features included in the acoustic feature sequence, which is a sequence of time-series acoustic features obtained from the acoustic signal sequence . A second acoustic feature quantity sequence that is a series , a probability P (acoustic feature quantity | acoustic event) that an acoustic event generates an acoustic feature quantity, and a probability P (acoustic event | situation) that a situation generates an acoustic event,
An acoustic event estimation unit that obtains an acoustic event sequence that maximizes the probability P (acoustic feature amount | acoustic event) for the second acoustic feature amount sequence;
Based on the distance between the distribution of acoustic events obtained from the acoustic event sequence and the distribution corresponding to each situation of probability P (acoustic event | situation) using the acoustic event as a random variable, the second acoustic feature quantity sequence A comparator that obtains a situation or situation column corresponding to
At least using the acoustic feature quantity sequence, the total number of types of the acoustic event, and the total number of types of the situation,
Perform a learning process to search for the maximum value of the simultaneous distribution corresponding to the combination of the acoustic event corresponding to the situation, the combination of the situation corresponding to the acoustic signal sequence, and the acoustic feature amount corresponding to the acoustic event,
A modeling unit that obtains at least a second probability P (acoustic event | situation) that the situation generates an acoustic event and a second probability P (acoustic feature quantity | acoustic event) that the acoustic event generates an acoustic feature quantity; Model processing device having.

At least, the acoustic feature quantity column is a column of acoustic features of the time series obtained from the acoustic signal sequence, the total number of types of acoustic events, and the total number of types of conditions used,
Perform a learning process to search for the maximum value of the simultaneous distribution corresponding to the combination of the acoustic event corresponding to the situation, the combination of the situation corresponding to the acoustic signal sequence, and the acoustic feature amount corresponding to the acoustic event,
A modeling unit that obtains at least a probability P (acoustic event | situation) that the situation generates an acoustic event and a probability P (acoustic feature quantity | acoustic event) that the acoustic event generates an acoustic feature;
At least the total number of types of acoustic events, the total number of types of situations , a second acoustic feature quantity sequence that is a sequence of some time-series acoustic feature quantities included in the acoustic feature quantity sequence, and the probability P ( Acoustic feature amount | acoustic event) and said probability P (acoustic event | situation),
An acoustic event estimation unit that obtains an acoustic event sequence that maximizes the probability P (acoustic feature amount | acoustic event) for the second acoustic feature amount sequence;
Based on the distance between the distribution of acoustic events obtained from the acoustic event sequence and the distribution corresponding to each situation of probability P (acoustic event | situation) using the acoustic event as a random variable, the second acoustic feature quantity sequence A comparator that obtains a situation or situation column corresponding to
A model processing apparatus.

The model processing device according to claim 7 or 8 , comprising:
A model processing apparatus further comprising: an acoustic feature amount calculation unit that obtains and outputs at least one of the acoustic feature amount sequence and the second acoustic feature amount sequence from an input acoustic signal sequence.

A model processing method performed by a model processing apparatus,
At least, using the acoustic feature sequence, which is a sequence of time-series acoustic features obtained from the acoustic signal sequence , the total number of types of acoustic events, and the total number of types of situations,
Perform a learning process to search for the maximum value of the simultaneous distribution corresponding to the combination of the acoustic event corresponding to the situation, the combination of the situation corresponding to the acoustic signal sequence, and the acoustic feature amount corresponding to the acoustic event,
A model processing method that obtains at least a probability P (acoustic event | situation) that a situation generates an acoustic event and a probability P (acoustic feature quantity | acoustic event) that the acoustic event generates an acoustic feature quantity.

A model processing method performed by a model processing apparatus,
Using at least the total number of types of acoustic events, the total number of types of situations, the acoustic feature quantity sequence, the probability P (acoustic feature quantity | acoustic event) and the probability P (acoustic event | situation) of any one of claims 1 to 3 ,
An acoustic event estimation step for obtaining an acoustic event sequence that maximizes the probability P (acoustic feature amount | acoustic event) for the acoustic feature amount sequence;
Corresponding to the acoustic feature value sequence based on the distance between the distribution of the acoustic event obtained from the acoustic event sequence and the distribution corresponding to each situation of the probability P (acoustic event | situation) using the acoustic event as a random variable A comparison step to obtain a situation or situation column to be
A model processing method.

A model processing method performed by a model processing apparatus,
At least the total number of types of acoustic events, the total number of types of situations, and some of the time-series acoustic features included in the acoustic feature sequence, which is a sequence of time-series acoustic features obtained from the acoustic signal sequence . A second acoustic feature quantity sequence that is a series , a probability P (acoustic feature quantity | acoustic event) that an acoustic event generates an acoustic feature quantity, and a probability P (acoustic event | situation) that a situation generates an acoustic event,
An acoustic event estimation step for obtaining an acoustic event sequence that maximizes the probability P (acoustic feature amount | acoustic event) for the second acoustic feature amount sequence;
Based on the distance between the distribution of acoustic events obtained from the acoustic event sequence and the distribution corresponding to each situation of probability P (acoustic event | situation) using the acoustic event as a random variable, the second acoustic feature quantity sequence A comparison step to obtain a situation or situation column corresponding to
At least using the acoustic feature quantity sequence , the total number of types of the acoustic event, and the total number of types of the situation,
Perform a learning process to search the maximum value of the simultaneous distribution corresponding to the combination of the acoustic event corresponding to the situation, the combination of the situation corresponding to the acoustic signal sequence, and the acoustic feature amount corresponding to the acoustic event,
A modeling step to obtain at least a second probability P (acoustic event | situation) that the situation generates an acoustic event and a second probability P (acoustic feature quantity | acoustic event) that the acoustic event generates an acoustic feature;
A model processing method.

A model processing method performed by a model processing apparatus,
At least, the acoustic feature quantity column is a column of acoustic features of the time series obtained from the acoustic signal sequence, the total number of types of acoustic events, and the total number of types of conditions used,
Perform a learning process to search for the maximum value of the simultaneous distribution corresponding to the combination of the acoustic event corresponding to the situation, the combination of the situation corresponding to the acoustic signal sequence, and the acoustic feature amount corresponding to the acoustic event,
A modeling step of obtaining at least a probability P (acoustic event | situation) that the situation generates an acoustic event and a probability P (acoustic feature quantity | acoustic event) that the acoustic event generates an acoustic feature;
At least the total number of types of acoustic events, the total number of types of situations , a second acoustic feature quantity sequence that is a sequence of some time-series acoustic feature quantities included in the acoustic feature quantity sequence, and the probability P ( Acoustic feature amount | acoustic event) and said probability P (acoustic event | situation),
An acoustic event estimation step for obtaining an acoustic event sequence that maximizes the probability P (acoustic feature amount | acoustic event) for the second acoustic feature amount sequence;
Based on the distance between the distribution of acoustic events obtained from the acoustic event sequence and the distribution corresponding to each situation of probability P (acoustic event | situation) using the acoustic event as a random variable, the second acoustic feature quantity sequence A comparison step to obtain a situation or situation column corresponding to
A model processing method.

A program for causing a computer to function as the model processing device according to claim 1.