JP2019035888A

JP2019035888A - Sound signal generation device, sound signal generation method, and program

Info

Publication number: JP2019035888A
Application number: JP2017157920A
Authority: JP
Inventors: 卓也上村; Takuya Kamimura; 裕貴寺島; Yuki Terajima; 茂人古川; Shigehito Furukawa
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-08-18
Filing date: 2017-08-18
Publication date: 2019-03-07
Anticipated expiration: 2037-08-18
Also published as: JP6716512B2

Abstract

To provide a sound signal generation technique for generating a sound having an arbitrary texture of general sound from white noise.SOLUTION: The sound signal generation device includes a first sound feature extraction unit for extracting a first sound feature which is a feature quantity not dependent on time from an input first sound, a second sound feature extraction unit for extracting a second sound feature which is a feature quantity dependent on time from an input second sound, a white signal generation unit for initializing the generated sound to be output with white noise, and a signal deformation unit for outputting a deformed generated sound using the first sound feature and the second sound feature. The signal deformation unit includes a generated sound feature extraction unit for extracting a first generated sound feature which is a feature quantity not dependent on time from the generated sound and a second generated sound feature which is a feature quantity dependent on time, and an error evaluation signal deformation unit which deforms the generated sound so that a feature error calculated from a first feature error which is the error between the first generated sound feature and the first sound feature and a second feature error which is the error between the second generated sound feature and the second sound feature is small.SELECTED DRAWING: Figure 2

Description

本発明は、入力された２つの音信号の双方の特徴を備えた音信号を得る技術に関する。 The present invention relates to a technique for obtaining a sound signal having features of both input two sound signals.

所望の質感を備える音を生成する技術の研究が進められている。そのような技術の一例として、非特許文献１にあるような、声質変換がある。この声質変換については、これまで数多くの研究がなされてきた。 Research on technology for generating sound with a desired texture is underway. An example of such a technique is voice quality conversion as described in Non-Patent Document 1. Numerous studies have been conducted on this voice quality conversion.

また、別の例として、非特許文献２では、ある質感を持った音をテクスチャ合成する手法について提案している。この手法では、入力された音から質感を表す特徴量を抽出し、その特徴量を用いて同じ質感を持つ新たな音を合成する。 As another example, Non-Patent Document 2 proposes a method for texture synthesis of sound having a certain texture. In this method, a feature amount representing a texture is extracted from the input sound, and a new sound having the same texture is synthesized using the feature amount.

T. Toda, A. W. Black, and K. Tokuda, “Spectral Conversion Based on Maximum Likelihood Estimation Considering Global Variance of Converted Parameter”, IEEE International Conference on Acoustic, Speech, and Signal Processing 2005 (ICASSP ’05) Proceedings, vol.1, no.1, pp.9-12, 2005.T. Toda, AW Black, and K. Tokuda, “Spectral Conversion Based on Maximum Likelihood Estimation Considering Global Variance of Converted Parameter”, IEEE International Conference on Acoustic, Speech, and Signal Processing 2005 (ICASSP '05) Proceedings, vol.1 , no.1, pp.9-12, 2005. J. H. McDermott and E. P. Simoncelli, “Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis”, Neuron, vol.71, no.5, pp.926-940, 2011.J. H. McDermott and E. P. Simoncelli, “Sound texture perception via statistics of the auditory periphery: Evidence from sound synthesis”, Neuron, vol.71, no.5, pp.926-940, 2011.

しかし、非特許文献１の声質変換技術では、変換目標となる質感が現実に存在する人の声の質感に限られる。また、非特許文献２は、声質ではなく、一般の音の質感を対象とするが、テクスチャ変換ではなく、テクスチャ合成を対象としている。 However, in the voice quality conversion technique of Non-Patent Document 1, the texture that is the conversion target is limited to the texture of a human voice that actually exists. Non-Patent Document 2 targets not only voice quality but general sound texture, but not texture conversion but texture synthesis.

つまり、声質以外の質感（例えば、発声する環境）や、音声以外の音の質感など、一般の音を任意の質感にテクスチャ変換する手法は存在しなかった。 In other words, there has been no technique for texture conversion of general sound into an arbitrary texture such as a texture other than voice quality (for example, an environment in which the voice is uttered) and a sound texture other than voice.

そこで本発明では、白色雑音から一般の音の任意の質感を持つ音を生成する音信号生成技術を提供することを目的とする。 Therefore, an object of the present invention is to provide a sound signal generation technique for generating a sound having an arbitrary texture of a general sound from white noise.

本発明の一態様は、入力された第１音から時刻によらない特徴量である第１音特徴を抽出する第１音特徴抽出部と、入力された第２音から時刻に依存する特徴量である第２音特徴を抽出する第２音特徴抽出部と、出力となる生成音を白色雑音で初期化する白色信号生成部と、前記第１音特徴と前記第２音特徴を用いて変形した前記生成音を出力する信号変形部とを含み、前記信号変形部は、前記生成音から時刻によらない特徴量である第１生成音特徴と時刻に依存する特徴量である第２生成音特徴を抽出する生成音特徴抽出部と、前記第１生成音特徴と前記第１音特徴との誤差である第１特徴誤差と前記第２生成音特徴と前記第２音特徴との誤差である第２特徴誤差とから計算される特徴誤差が小さくなるように、前記生成音を変形する誤差評価信号変形部とを含む。 One aspect of the present invention includes a first sound feature extraction unit that extracts a first sound feature that is a feature quantity that does not depend on time from the input first sound, and a feature quantity that depends on time from the input second sound. A second sound feature extracting unit that extracts the second sound feature, a white signal generating unit that initializes a generated sound to be output with white noise, and a modification using the first sound feature and the second sound feature A signal generating unit that outputs the generated sound, and the signal deforming unit includes a first generated sound feature that is a feature quantity independent of time from the generated sound and a second generated sound that is a feature quantity dependent on time. A generated sound feature extracting unit for extracting features; a first characteristic error that is an error between the first generated sound feature and the first sound feature; and an error between the second generated sound feature and the second sound feature. An error that deforms the generated sound so that the characteristic error calculated from the second characteristic error is small. And a value signal modifying unit.

本発明によれば、白色雑音から一般の音の任意の質感を持つ音を生成することが可能となる。 According to the present invention, it is possible to generate a sound having an arbitrary texture of a general sound from white noise.

音信号生成アルゴリズムで用いるパラメータを示す表。The table | surface which shows the parameter used with a sound signal generation algorithm. 音信号生成装置１００の構成の一例を示すブロック図。1 is a block diagram showing an example of the configuration of a sound signal generation device 100. FIG. 音信号生成装置１００の動作の一例を示すフローチャート。5 is a flowchart showing an example of the operation of the sound signal generation device 100.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

実施形態の説明に先立って、この明細書における表記方法について説明する。 Prior to the description of the embodiment, a description method in this specification will be described.

_（アンダースコア）は下付き添字を表す。例えば、x^y_zはy_zがxに対する上付き添字であり、x_{y_z}はy_zがxに対する下付き添字であることを表す。 _ (Underscore) represents a subscript. For example, ^xy_z represents that _yz is a superscript to x, and _{xy_z} represents that _yz is a subscript to x.

＜技術的背景＞
ここでは、本発明の実施形態で用いる音信号生成アルゴリズムについて説明する。まず、その概要について説明する。 <Technical background>
Here, the sound signal generation algorithm used in the embodiment of the present invention will be described. First, the outline will be described.

（音信号生成アルゴリズムの概要）
(1) 内容を保持して質感を変換したい音（以下、第２音という）と、変換目標の質感を持った音（以下、第１音という）を入力する。
(2) 第１音から、第１音の質感を表す特徴量（第１音特徴）を抽出する。第１音特徴として、聴覚系モデルの出力波形の時間周辺統計量を用いる。ここで、時間周辺統計量とは、時間の次元で周辺化した統計量のことであり、時間周辺統計量として、例えば、平均、分散、歪度、相関係数を用いる。したがって、時間周辺統計量は、時刻によらない特徴量となる。
(3) 第２音から、第２音の内容を表す特徴量（第２音特徴）を抽出する。第２音特徴として、聴覚系モデルの出力波形の時間パターンを用いる。ここでは、時間パターンとして、出力波形そのもの、つまり、振幅包絡、振幅包絡を帯域分割したものを用いる。したがって、時間パターンは、時刻に依存する特徴量となる。
(4) 生成音を白色雑音で初期化する。
(5) 生成音の質感を表す特徴量（第１生成音特徴）が第１音特徴に、生成音の内容を表す特徴量（第２生成音特徴）が第２音特徴にそれぞれ近づくように、生成音を変形する。
(6) 変形後の生成音を出力する。 (Outline of sound signal generation algorithm)
(1) Input a sound whose content is to be converted (hereinafter referred to as a second sound) and a sound having a conversion target texture (hereinafter referred to as a first sound).
(2) A feature amount (first sound feature) representing the texture of the first sound is extracted from the first sound. As the first sound feature, the time periphery statistic of the output waveform of the auditory system model is used. Here, the time-peripheral statistic is a statistic that is marginalized in the dimension of time. For example, an average, variance, skewness, and correlation coefficient are used as the time-peripheral statistic. Therefore, the time periphery statistic is a feature quantity that does not depend on time.
(3) A feature amount (second sound feature) representing the content of the second sound is extracted from the second sound. As the second sound feature, the time pattern of the output waveform of the auditory system model is used. Here, as the time pattern, the output waveform itself, that is, the amplitude envelope and the band obtained by dividing the amplitude envelope are used. Therefore, the time pattern is a feature quantity dependent on time.
(4) The generated sound is initialized with white noise.
(5) The feature amount (first generated sound feature) representing the texture of the generated sound approaches the first sound feature, and the feature amount (second generated sound feature) representing the content of the generated sound approaches the second sound feature. , Transform the generated sound.
(6) Output the generated sound after deformation.

ここで、聴覚系モデルとは、入力された音に対してヒトの聴覚系を模擬した処理を行うものである。聴覚系モデルは、まず、音を複数の帯域に分割し、その振幅包絡と微細構造を抽出する。次に、聴覚系における応答の非線形性を模擬するために、振幅包絡を非線形圧縮する。最後に、帯域ごとの振幅包絡の変動速度を反映するために、振幅包絡を帯域分割した波形を計算する。つまり、聴覚系モデルは、以下の４つの処理から構成される。
(a) バンドパスフィルタバンク
(b) 包絡線抽出
(c) 非線形圧縮
(d) バンドパスフィルタバンク Here, the auditory system model is a process for simulating a human auditory system on an input sound. The auditory system model first divides a sound into a plurality of bands, and extracts its amplitude envelope and fine structure. Next, in order to simulate the nonlinearity of the response in the auditory system, the amplitude envelope is nonlinearly compressed. Finally, in order to reflect the fluctuation speed of the amplitude envelope for each band, a waveform obtained by dividing the amplitude envelope into bands is calculated. That is, the auditory system model is composed of the following four processes.
(a) Bandpass filter bank
(b) Envelope extraction
(c) Nonlinear compression
(d) Bandpass filter bank

また、聴覚系モデルは、音の振幅値x(t)を入力とし、帯域ごとの波形s(i,t)、波形s(i,t)の振幅包絡e(i,t)、波形s(i,t)の微細構造p(i,t)、振幅包絡e(i,t)を帯域分割した波形m(i,k,t)の４つの出力波形を出力する。これらの出力波形から、上記(2)及び(3)の処理において時間周辺統計量や時間パターンを計算する。 In addition, the auditory system model takes the amplitude value x (t) of the sound as input, the waveform s (i, t) for each band, the amplitude envelope e (i, t) of the waveform s (i, t), and the waveform s ( Four output waveforms of a waveform m (i, k, t) obtained by dividing a fine structure p (i, t) of i, t) and an amplitude envelope e (i, t) are output. From these output waveforms, the time periphery statistics and the time pattern are calculated in the processes (2) and (3).

続いて、音信号生成アルゴリズムについて説明する。 Next, the sound signal generation algorithm will be described.

（音信号生成アルゴリズム）
処理内容：第１音と第２音を入力とし、それぞれの特徴をあわせ持つ音である生成音を出力する。
入力：第１音の振幅値x_tex(t)、第２音の振幅値x_con(t)
出力：生成音の振幅値x_gen(t) (Sound signal generation algorithm)
Processing content: The first sound and the second sound are input, and a generated sound that is a sound having both characteristics is output.
Input: first sound amplitude value x _tex (t), second sound amplitude value x _con (t)
Output: Generated sound amplitude value x _gen (t)

ここで、tは離散時刻を表すパラメータであり、t∈{1, 2, …, T_{x_type}}とする。ただし、type∈{tex,con,gen}とする。つまり、T_{x_tex}は第１音の長さ（サンプル数）、T_{x_con}は第２音の長さ（サンプル数）、T_{x_gen}は生成音の長さ（サンプル数）を表す。また、T_{x_con}=T_{x_gen}となる。 Here, t is a parameter representing discrete time and is assumed to be t∈ {1, 2,..., T _{x_type} }. However, typeε {tex, con, gen}. That is, T _{x_tex} represents the length of the first sound (number of samples), T _{x_con} represents the length of the second sound (number of samples), and T _{x_gen} represents the length of the generated sound (number of samples). Further, T _{x_con} = T _{x_gen} .

なお、本アルゴリズムの計算時間はT_{x_con}について線形に増加する。
(1) 聴覚系モデルにより、第１音の振幅値x_tex(t)から、帯域ごとの波形s_tex(i,t)、波形s_tex(i,t)の振幅包絡e_tex(i,t)、振幅包絡e_tex(i,t)を帯域分割した波形m_tex(i,k,t)を計算する。聴覚系モデルについては後述する。ここで、i、kは帯域番号を表すパラメータであり、i∈{1, 2, …, N_i}、k∈{1, 2, …, N_k}とする。
(2) 振幅包絡e_tex(i,t)と波形m_tex(i,k,t)の時間周辺統計量f_mean,tex(i)、f_var,tex(i)、f_skew,tex(i)、f_cce,tex(i,j)、f_pow,tex(i,k)、f_ccm,tex(i,j,k)を計算する。各時間周辺統計量の計算方法については後述する。
(3) 波形s_tex(i,t)の分散σ_{s_tex}(i)²を計算する。 Note that the calculation time of this algorithm increases linearly with respect to T _{x_con} .
(1) by the auditory system model, from the amplitude value of the first sound x _tex (t), for each band waveform s _tex (i, t), amplitude envelope e _tex waveform _{s tex (i, t) (} i, t ), A waveform m _tex (i, k, t) obtained by band-dividing the amplitude envelope e _tex (i, t) is calculated. The auditory system model will be described later. Here, i, k is a parameter representing a band number, i∈ {1, 2, ... , N i}, k∈ {1, 2, ..., N k} and.
(2) Time marginal statistics f _{mean, tex} (i), f _{var, tex} (i), f _{skew, tex} (i) of amplitude envelope e _tex (i, t) and waveform m _tex (i, k, t) ), F _{cce, tex} (i, j), f _{pow, tex} (i, k), and f _{ccm, tex} (i, j, k). A method for calculating the statistics around each time will be described later.
(3) The variance σ _{s_tex} (i) ² of the waveform s _tex (i, t) is calculated.

(4) 聴覚系モデルにより、第２音の振幅値x_con(t)から、波形s_con(i,t)の振幅包絡e_con(i,t)と振幅包絡e_con(i,t)を帯域分割した波形m_con(i,k,t)を計算する。
(5) 生成音x_gen(t)を白色雑音で初期化する。つまり、x_gen(t)←白色雑音とする。
(6) (6.1)から(6.11)をN_iter回繰り返す。N_iterは1以上の整数であり、例えば、20とするとよい（図１参照）。
(6.1) 聴覚系モデルにより、生成音の振幅値x_gen(t)から、帯域ごとの波形s_gen(i,t)、波形s_gen(i,t)の振幅包絡e_gen(i,t)、波形s_gen(i,t)の微細構造p_gen(i,t)、振幅包絡e_gen(i,t)を帯域分割した波形m_gen(i,k,t)を計算する。
(6.2) 振幅包絡e_gen(i,t)と波形m_gen(i,k,t)の時間周辺統計量f_mean,gen(i)、f_var,gen(i)、f_skew,gen(i)、f_cce,gen(i,j)、f_pow,gen(i,k)、f_ccm,gen(i,j,k)を計算する。
(6.3) 特徴誤差をL(i)=αL_con(i)+(1-α)L_tex(i)とし、第２特徴誤差L_con(i)、第１特徴誤差L_tex(i)をそれぞれ以下で計算する。ただし、α(0≦α≦1)は第２音と第１音がそれぞれ生成音に与える影響のバランスを決定するパラメータである。 (4) Using the auditory system model, the amplitude envelope e _con (i, t) and the amplitude envelope e _con (i, t) of the waveform s _con (i, t) are _calculated from the amplitude value x _con (t) of the second sound. Calculate the band-divided waveform m _con (i, k, t).
(5) The generated sound x _gen (t) is initialized with white noise. That is, x _gen (t) ← white noise.
(6) Repeat (6.1) to (6.11) _Niter times. N _iter is an integer greater than or _equal to 1, for example 20 (see FIG. 1).
(6.1) From the amplitude value x _gen (t) of the generated sound, the amplitude envelope e _gen (i, t) of the waveform s _gen (i, t) and waveform s _gen (i, t) for each band is determined by the auditory system model , waveform s _gen (i, t) of the microstructure p _gen (i, t), amplitude envelope e _gen (i, t) to band divided waveform _{m gen (i, k, t} ) is calculated.
(6.2) Time marginal statistics f _{mean, gen} (i), f _{var, gen} (i), f _{skew, gen} (i) of amplitude envelope e _gen (i, t) and waveform m _gen (i, k, t) ), F _{cce, gen} (i, j), f _{pow, gen} (i, k), f _{ccm, gen} (i, j, k).
(6.3) The feature error is L (i) = αL _con (i) + (1-α) L _tex (i), and the second feature error L _con (i) and the first feature error L _tex (i) are respectively Calculate as follows. Here, α (0 ≦ α ≦ 1) is a parameter that determines the balance of the influence of the second sound and the first sound on the generated sound.

なお、α=0のとき、第１音特徴を持ち、第２音特徴を持たない音が合成される。また、α=1のとき、雑音駆動音声が合成される。
(6.3.1) 第２特徴誤差L_con(i)を、次式で表される、第２音の振幅包絡e_con(i,t)と振幅包絡e_con(i,t)を帯域分割した波形m_con(i,k,t)と、生成音の振幅包絡e_gen(i,t)と振幅包絡e_gen(i,t)を帯域分割した波形m_gen(i,k,t)との二乗誤差とする。 When α = 0, a sound having the first sound feature and not having the second sound feature is synthesized. Also, when α = 1, noise-driven speech is synthesized.
(6.3.1) The second characteristic error L _con (i) is divided into bands of the amplitude envelope e _con (i, t) and the amplitude envelope e _con (i, t) of the second sound expressed by the following equation: The waveform m _con (i, k, t) and the waveform m _gen (i, k, t) obtained by band-dividing the amplitude envelope e _gen (i, t) and the amplitude envelope e _gen (i, t) of the generated sound Let squared error.

(6.3.2) 第１特徴誤差L_tex(i)を、次式で表される、第１音の時間周辺統計量と生成音の時間周辺統計量との二乗誤差とする。 (6.3.2) The first characteristic error L _tex (i) is a square error between the time-peripheral statistic of the first sound and the time-peripheral statistic of the generated sound expressed by the following equation.

ただし、Σ_j,kは、jやkを引数に取るstatについてのみ実行するものとする。具体的には、stat∈{cce, pow, ccm}に対して実行することになる。
(6.4) 分散σ_{s_tex}(i)²が大きいiから順に、確率勾配法によって、特徴誤差L(i)が小さくなるように振幅包絡e_gen(i,t)を変形する。変形した振幅包絡をe’_gen(i,t)とする。
(6.5) 振幅包絡e’_gen(i,t)に対して、聴覚系モデルにおける非線形圧縮の逆の操作をする。すなわち、振幅包絡e’_gen(i,t)をβ分の1乗したものを改めてe’_gen(i,t)とする。 However, Σ _{j, k} is executed only for stats that take j and k as arguments. Specifically, it is executed for statε {cce, pow, ccm}.
(6.4) The amplitude envelope e _gen (i, t) is modified by the stochastic gradient method so that the characteristic error L (i) becomes smaller in order from i with the largest variance σ _{s_tex} (i) ² . Let the deformed amplitude envelope be e ′ _gen (i, t).
(6.5) The inverse operation of nonlinear compression in the auditory system model is performed on the amplitude envelope e ′ _gen (i, t). That is, an amplitude envelope e ′ _gen (i, t) raised to the power of β is re- _{designated as} e ′ _gen (i, t).

(6.6) 振幅包絡e’_gen(i,t)を振幅包絡のサンプリング周波数f_eから振幅値のサンプリング周波数f_xへアップサンプルする。例えば、f_x=20kHz, f_e=400Hzとするとよい（図１参照）。アップサンプルした振幅包絡を改めてe’_gen(i,t)とする。
(6.7) 振幅包絡e’_gen(i,t)と微細構造p_gen(i,t)から、次式により波形s’_gen(i,t)を復元する。 (6.6) The amplitude envelope e ′ _gen (i, t) is up-sampled from the sampling frequency f _e of the amplitude envelope to the sampling frequency f _{x of the} amplitude value. For example, f _x = 20 kHz and f _e = 400 Hz are preferable (see FIG. 1). _Let e ′ _gen (i, t) be the _upsampled amplitude envelope.
(6.7) The waveform s ′ _gen (i, t) is restored from the amplitude envelope e ′ _gen (i, t) and the fine structure p _gen (i, t) by the following equation.

ただし、jは虚数単位である。
(6.8) 波形s’_gen(i,t)の分散が波形s_tex(i,t)の分散と等しくなるように、次式のように、波形s’_gen(i,t)にσ_{s_tex}(i)/σ_{s’_gen}(i)をかける。すなわち、波形s’_gen(i,t)にσ_{s_tex}(i)/σ_{s’_gen}(i)をかけたものを改めてs’_gen(i,t)とする。 However, j is an imaginary unit.
(6.8) 'so that the variance of the _gen (i, t) is equal to the variance of the waveform s _tex (i, t), as in the following equation, the waveform s' waveform s _gen (i, t) to the sigma _{S_tex} ( i) _Multiply / σ _{s'_gen} (i). In other words, the waveform s ′ _gen (i, t) _multiplied by σ _{s_tex} (i) / σ _{s′_gen} (i) is changed to s ′ _gen (i, t).

ここで、σ_{s’_gen}(i)は波形s’_gen(i,t)の標準偏差であり、次式で表される。 Here, σ _{s′_gen} (i) is a standard deviation of the waveform s ′ _gen (i, t), and is expressed by the following equation.

(6.9) 波形s’_gen(i,t)にバンドパスフィルタバンクF_a(i)を適用する。適用後の波形を改めてs’_gen(i,t)とする。なお、バンドパスフィルタF_a(i)については後述する。
(6.10) 波形s’_gen(i,t)を次式により足し合わせ、振幅値x’_gen(t)を算出する。 (6.9) Apply the bandpass filter bank F _a (i) to the waveform s ′ _gen (i, t). _Let s' _gen (i, t) be the waveform after application. The bandpass filter F _a (i) will be described later.
(6.10) The waveform s ′ _gen (i, t) is added according to the following equation to calculate the amplitude value x ′ _gen (t).

(6.11) x_gen(t)←x’_gen(t)とする。
（聴覚系モデル）
聴覚系モデルでは、先述した通り、入力された音に対してヒトの聴覚系を模擬した処理を行う。 (6.11) Let x _gen (t) ← x ' _gen (t).
(Hearing system model)
In the auditory system model, as described above, a process simulating the human auditory system is performed on the input sound.

以下、簡単のため、type∈{tex,con,gen}として、x_type(t)、s_type(i,t)、e_type(i,t)、p_type(i,t)、m_type(i,k,t)をそれぞれx(t)、s(i,t)、e(i,t)、p(i,t)、m(i,k,t)と表す。
入力：音の振幅値x(t)
出力：帯域ごとの波形s(i,t)、波形s(i,t)の振幅包絡e(i,t)、波形s(i,t)の微細構造p(i,t)、振幅包絡e(i,t)を帯域分割した波形m(i,k,t)
(1) 振幅値x(t)をバンドパスフィルタバンクF_a(i)より帯域分割する。フィルタの中心周波数c_a(i)がEquivalent rectangular bandwidth（ERB）スケール上で等間隔になるようにする。ERBスケールとは、蝸牛におけるフィルタの配置を模擬したスケールである。ここではerb(fω)と表す。ただしfはサンプリング周波数、ωは正規化周波数とする。i番目の帯域のバンド幅は、フィルタの係数が減衰して0になる点がc_a(i-1)とc_a(i+1)となるように設定する。
(1.1) 振幅値x(t)を離散フーリエ変換したものをX(ω)とする。ωは正規化周波数であり、ω∈{0, 1/T_x, 2/T_x, …, 1/2}となる。
(1.2) X(ω)にフィルタの係数F_a(i,ω)をかける。その結果をS(i,ω)とする。 In the following, for simplicity, type ∈ {tex, con, gen} and x _type (t), s _type (i, t), e _type (i, t), p _type (i, t), m _type ( i, k, t) are represented as x (t), s (i, t), e (i, t), p (i, t), m (i, k, t), respectively.
Input: Sound amplitude value x (t)
Output: waveform s (i, t) for each band, amplitude envelope e (i, t) of waveform s (i, t), fine structure p (i, t) of waveform s (i, t), amplitude envelope e Waveform m (i, k, t) obtained by band dividing (i, t)
(1) The amplitude value x (t) is divided into bands from the bandpass filter bank F _a (i). The center frequency c _a (i) of the filter is set to be equally spaced on the Equivalent rectangular bandwidth (ERB) scale. The ERB scale is a scale that simulates the arrangement of filters in the cochlea. Here, it is expressed as erb (fω). Where f is the sampling frequency and ω is the normalized frequency. The bandwidth of the i-th band is set so that the points at which the filter coefficient attenuates to 0 become c _a (i−1) and c _a (i + 1).
(1.1) Let X (ω) be the discrete Fourier transform of the amplitude value x (t). ω is a normalized frequency, and ω∈ {0, 1 / T _x , 2 / T _x , ..., 1/2}.
(1.2) Multiply X (ω) by the filter coefficient F _a (i, ω). The result is S (i, ω).

ただし、f_xは振幅値x(t)のサンプリング周波数、N_iはバンドパスフィルタバンクF_a(i)の帯域数、ω_a0はバンドパスフィルタバンクF_a(i)の最低帯域の中心周波数（最小中心周波数）、ω_a1はバンドパスフィルタバンクF_a(i)の最高帯域の中心周波数（最大中心周波数）とする。例えば、f_x=20kHz, N_i=30, ω_a0=20/f_x, ω_a1=10000/f_xとするとよい（図１参照）。
(1.3) S(i,ω)を離散逆フーリエ変換したものを帯域ごとの波形s(i,t)とする。
(2) 波形s(i,t)の振幅包絡e(i,t)、微細構造p(i,t)をヒルベルト変換により計算する。つまり、振幅包絡e(i,t)、微細構造p(i,t)をそれぞれ波形s(i,t)のヒルベルト変換の絶対値、偏角とする。
(3) 振幅包絡e(i,t)を非線形圧縮する。すなわち、振幅包絡e(i,t)をβ乗したものを改めてe(i,t)とする。 However, f _x is the sampling frequency of the amplitude value x (t), N _i is the number of bands of the band-pass filter bank F _a (i), ω _a0 bandpass filter bank F _a minimum bandwidth of the center frequency of (i) ( Ω _a1 is the center frequency of the highest band (maximum center frequency) of the bandpass filter bank F _a (i). For _{_{example, f x = 20kHz, N i}} = 30, ω a0 = 20 / f x, or equal to ω _a1 = 10000 / f _x (see FIG. 1).
(1.3) S (i, ω) is a discrete inverse Fourier transform, which is a waveform s (i, t) for each band.
(2) The amplitude envelope e (i, t) and fine structure p (i, t) of the waveform s (i, t) are calculated by the Hilbert transform. That is, the amplitude envelope e (i, t) and the fine structure p (i, t) are set as the absolute value and the argument of the Hilbert transform of the waveform s (i, t), respectively.
(3) Non-linearly compress the amplitude envelope e (i, t). That is, the amplitude envelope e (i, t) raised to the power of β is re-designated as e (i, t).

ただし、β(0<β≦1)は圧縮の程度を決めるパラメータである。例えば、β=0.3とするとよい（図１参照）。
(4) 振幅包絡e(i,t)を振幅値のサンプリング周波数f_xから振幅包絡のサンプリング周波数f_eへダウンサンプルする。ダウンサンプルにより、t∈{1, 2, …, T_e}となる。ただし、T_eは振幅包絡e(i,t)の長さ（サンプル数）であり、T_e/f_e=T_x/f_xである。ダウンサンプルした振幅包絡を改めてe(i,t)とする。
(5) 振幅包絡e(i,t)をバンドパスフィルタバンクF_m(k)により帯域分割する。このバンドパスフィルタバンクF_m(k)は、聴覚末梢系に存在すると考えられている変調フィルタバンクを想定したものである。フィルタの中心周波数の間隔はlogスケールとし、バンド幅はシャープネス（Q値）が2となるようにする。
(5.1) 振幅包絡e(i,t)を離散フーリエ変換したものをE(i,ω)とする。ωは正規化周波数であり、ω∈{0, 1/T_e, 2/T_e, …, 1/2}となる。
(5.2) E(i,ω)にフィルタの係数F_m(k,ω)をかける。その結果をM(i,k,ω)とする。 However, β (0 <β ≦ 1) is a parameter that determines the degree of compression. For example, β = 0.3 is preferable (see FIG. 1).
(4) The amplitude envelope e (i, t) is down-sampled from the sampling frequency f _x of the amplitude value to the sampling frequency f _{e of the} amplitude envelope. By down-sampling, t∈ {1, 2,…, T _e }. However, T _e is the length of the amplitude envelope e (i, t) (the number of samples), a _{_{_{T e / f e = T x}}} / f x. Let the down-sampled amplitude envelope be e (i, t) again.
(5) The amplitude envelope e (i, t) is divided into bands by the bandpass filter bank F _m (k). This band pass filter bank F _m (k) is assumed to be a modulation filter bank that is considered to exist in the auditory peripheral system. The interval between the center frequencies of the filters is the log scale, and the bandwidth is set so that the sharpness (Q value) is 2.
(5.1) Let E (i, ω) be the discrete Fourier transform of the amplitude envelope e (i, t). ω is a normalized frequency, and ω∈ {0, 1 / T _e , 2 / T _e , ..., 1/2}.
(5.2) Multiply E (i, ω) by filter coefficient F _m (k, ω). The result is M (i, k, ω).

ただし、f_eは振幅包絡e(i,t)のサンプリング周波数、N_kはバンドパスフィルタバンクF_m(k)の帯域数、ω_m0はバンドパスフィルタバンクF_m(k)最低帯域の中心周波数（最小中心周波数）、ω_m1はバンドパスフィルタバンクF_m(k)最高帯域の中心周波数（最大中心周波数）とする。例えば、f_e=400Hz, N_k=20, ω_m0=0.5/f_e, ω_m1=200/f_eとするとよい（図１参照）。
(5.3) M(i,k,ω)を離散逆フーリエ変換したものをm(i,k,t)とする。
（時間周辺統計量の計算）
聴覚系モデルの出力波形から、時間周辺統計量を計算する。 Where f _e is the sampling frequency of the amplitude envelope e (i, t), N _k is the number of bands in the bandpass filter bank F _m (k), ω _m0 is the center frequency of the bandpass filter bank F _m (k) lowest band (Minimum center frequency), ω _m1 is the center frequency (maximum center frequency) of the highest band of the bandpass filter bank F _m (k). For example, f _e = 400 Hz, N _k = 20, ω _m0 = 0.5 / f _e , and ω _m1 = 200 / f _e (see FIG. 1).
(5.3) Let M (i, k, t) be M (i, k, ω) obtained by discrete inverse Fourier transform.
(Calculation of statistics around time)
Calculate the time statistic from the output waveform of the auditory system model.

以下、簡単のため、type∈{tex, gen}として、s_type(i,t)、e_type(i,t)、m_type(i,k,t)をそれぞれs(i,t)、e(i,t)、m(i,k,t)と表す。また、時間周辺統計量f_mean,type(i)、f_var,type(i)、f_skew,type(i)、f_cce,type(i,j)、f_pow,type(i,k)、f_ccm,type(i,j,k)をそれぞれf_mean(i)、f_var(i)、f_skew(i)、f_cce(i,j)、f_pow(i,k)、f_ccm(i,j,k)と表す。
入力：波形s(i,t)の振幅包絡e(i,t)、振幅包絡e(i,t)を帯域分割した波形m(i,k,t)
出力：時間周辺統計量f_mean(i)、f_var(i)、f_skew(i)、f_cce(i,j)、f_pow(i,k)、f_ccm(i,j,k)
(1) 振幅包絡e(i,t)の時間周辺統計量を計算する。
(1.1) e(i,t)の平均μ_e(i)をf_mean(i)とする。 Hereinafter, for simplicity, s _type (i, t), e _type (i, t), and m _type (i, k, t) are set to s (i, t) and e, respectively, as type∈ {tex, gen}. (i, t) and m (i, k, t). In addition, the time marginal statistics f _{mean, type} (i), f _{var, type} (i), f _{skew, type} (i), f _{cce, type} (i, j), f _{pow, type} (i, k), f _{ccm, type} (i, j, k) is f _mean (i), f _var (i), f _skew (i), f _cce (i, j), f _pow (i, k), f _ccm ( i, j, k).
Input: Waveform m (i, k, t) obtained by band-dividing the amplitude envelope e (i, t) and amplitude envelope e (i, t) of the waveform s (i, t)
Output: Time marginal statistics f _mean (i), f _var (i), f _skew (i), f _cce (i, j), f _pow (i, k), f _ccm (i, j, k)
(1) Calculate the time marginal statistic of the amplitude envelope e (i, t).
(1.1) The mean μ _e (i) of e (i, t) is _defined as f _mean (i).

(1.2) e(i,t)の分散σ_e(i)²を平均μ_e(i)の2乗で割った値をf_var(i)とする。 (1.2) A value obtained by dividing the variance σ _e (i) ² of e (i, t) by the square of the average μ _e (i) is defined as f _var (i).

(1.3) e(i,t)の歪度をf_skew(i)とする。 (1.3) Let the skewness of e (i, t) be f _skew (i).

(1.4)e(i,t)とe(j,t)の相関係数をf_cce(i,j)とする。 (1.4) Let the correlation coefficient between e (i, t) and e (j, t) be f _cce (i, j).

(2) 波形m(i,k,t)の時間周辺統計量を計算する。
(2.1) m(i,k,t)の二乗平均を分散σ_m(i,k)²で割った値をf_pow(i,k)とする。ここで、μ_m(i,k)はm(i,k,t)の平均である。 (2) Calculate the time marginal statistics of the waveform m (i, k, t).
(2.1) Let f _pow (i, k) be the value obtained by dividing the root mean square of m (i, k, t) by the variance σ _m (i, k) ² . Here, μ _m (i, k) is an average of m (i, k, t).

(2.2) m(i,k,t)とm(j,k,t)の相関係数をf_ccm(i,j,k)とする。 (2.2) Let the correlation coefficient of m (i, k, t) and m (j, k, t) be f _ccm (i, j, k).

図１は、上記音信号生成アルゴリズムで用いたパラメータを一覧にした表である。なお、この表の値はあくまで一例である。 FIG. 1 is a table listing the parameters used in the sound signal generation algorithm. Note that the values in this table are merely examples.

（音信号生成アルゴリズムの変形例）
様々な複雑な質感を備える音を性能よく生成するためには、上述の音信号生成アルゴリズムのように、帯域分割を行ったうえで音信号を生成するのが好ましいが、帯域分割を行わずに音信号を生成することもできる。ここでは、音信号生成アルゴリズムを簡易化したアルゴリズムについて説明する。 (Modification of sound signal generation algorithm)
In order to generate sound with various complex textures with good performance, it is preferable to generate a sound signal after performing band division as in the above sound signal generation algorithm, but without performing band division. A sound signal can also be generated. Here, an algorithm that simplifies the sound signal generation algorithm will be described.

まず、簡易化音信号生成アルゴリズム１について説明する。簡易化音信号生成アルゴリズム１は、振幅包絡e(i,t)を帯域分割した波形m(i,k,t)を計算しない点において、音信号生成アルゴリズムと異なる。
［簡易化音信号生成アルゴリズム１］
(1) 聴覚系モデルにより、第１音の振幅値x_tex(t)から、帯域ごとの波形s_tex(i,t)、波形s_tex(i,t)の振幅包絡e_tex(i,t)を計算する。
(2) 振幅包絡e_tex(i,t)の時間周辺統計量f_mean,tex(i)、f_var,tex(i)、f_skew,tex(i)、f_cce,tex(i,j)を計算する。
(3) 波形s_tex(i,t)の分散σ_{s_tex}(i)²を計算する。
(4) 聴覚系モデルにより、第２音の振幅値x_con(t)から、帯域ごとの波形s_con(i,t)の振幅包絡e_con(i,t)を計算する。
(5) 生成音x_gen(t)を白色雑音で初期化する。つまり、x_gen(t)←白色雑音とする。
(6) (6.1)から(6.11)をN_iter回繰り返す。
(6.1) 聴覚系モデルにより、生成音の振幅値x_gen(t)から、帯域ごとの波形s_gen(i,t)、波形s_gen(i,t)の振幅包絡e_gen(i,t)、波形s_gen(i,t)の微細構造p_gen(i,t)を計算する。
(6.2) 振幅包絡e_gen(i,t)の時間周辺統計量f_mean,gen(i)、f_var,gen(i)、f_skew,gen(i)、f_cce,gen(i,j)を計算する。
(6.3) 特徴誤差をL(i)=αL_con(i)+(1-α)L_tex(i)とし、第２特徴誤差L_con(i)、第１特徴誤差L_tex(i)をそれぞれ以下で計算する。
(6.3.1) 第２特徴誤差L_con(i)を、次式で表される、第２音の振幅包絡e_con(i,t)と生成音の振幅包絡e_gen(i,t)との二乗誤差とする。 First, the simplified sound signal generation algorithm 1 will be described. The simplified sound signal generation algorithm 1 is different from the sound signal generation algorithm in that the waveform m (i, k, t) obtained by dividing the amplitude envelope e (i, t) is not calculated.
[Simplified sound signal generation algorithm 1]
(1) by the auditory system model, from the amplitude value of the first sound x _tex (t), for each band waveform s _tex (i, t), amplitude envelope e _tex waveform _{s tex (i, t) (} i, t ).
(2) Time envelope statistics f _{mean, tex} (i), f _{var, tex} (i), f _{skew, tex} (i), f _{cce, tex} (i, j) of amplitude envelope e _tex (i, t) Calculate
(3) The variance σ _{s_tex} (i) ² of the waveform s _tex (i, t) is calculated.
(4) The amplitude envelope e _con (i, t) of the waveform s _con (i, t) for each band is calculated from the amplitude value x _con (t) of the second sound using the auditory system model.
(5) The generated sound x _gen (t) is initialized with white noise. That is, x _gen (t) ← white noise.
(6) Repeat (6.1) to (6.11) _Niter times.
(6.1) From the amplitude value x _gen (t) of the generated sound, the amplitude envelope e _gen (i, t) of the waveform s _gen (i, t) and waveform s _gen (i, t) for each band is determined by the auditory system model Then, the fine structure p _gen (i, t) of the waveform s _gen (i, t) is calculated.
(6.2) Time envelope statistics f _{mean, gen} (i), f _{var, gen} (i), f _{skew, gen} (i), f _{cce, gen} (i, j) of amplitude envelope e _gen (i, t) Calculate
(6.3) The feature error is L (i) = αL _con (i) + (1-α) L _tex (i), and the second feature error L _con (i) and the first feature error L _tex (i) are respectively Calculate as follows.
(6.3.1) The second characteristic error L _con (i) is expressed by the following equation, the second sound amplitude envelope e _con (i, t) and the generated sound amplitude envelope e _gen (i, t): Is the square error.

ただし、Σ_jは、jを引数に取るstatについてのみ実行するものとする。具体的には、stat∈{cce}に対して実行することになる。
(6.4) 分散σ_{s_tex}(i)²が大きいiから順に、確率勾配法によって、特徴誤差L(i)が小さくなるように振幅包絡e_gen(i,t)を変形する。変形した振幅包絡をe’_gen(i,t)とする。
(6.5) 振幅包絡e’_gen(i,t)に対して、聴覚系モデルにおける非線形圧縮の逆の操作をする。すなわち、振幅包絡e’_gen(i,t)をβ分の1乗したものを改めてe’_gen(i,t)とする。 However, Σ _j is executed only for a stat that takes j as an argument. Specifically, it is executed for statε {cce}.
(6.4) The amplitude envelope e _gen (i, t) is modified by the stochastic gradient method so that the characteristic error L (i) becomes smaller in order from i with the largest variance σ _{s_tex} (i) ² . Let the deformed amplitude envelope be e ′ _gen (i, t).
(6.5) The inverse operation of nonlinear compression in the auditory system model is performed on the amplitude envelope e ′ _gen (i, t). That is, an amplitude envelope e ′ _gen (i, t) raised to the power of β is re- _{designated as} e ′ _gen (i, t).

(6.6) 振幅包絡e’_gen(i,t)を振幅包絡のサンプリング周波数f_eから振幅値のサンプリング周波数f_xへアップサンプルする。アップサンプルした振幅包絡を改めてe’_gen(i,t)とする。
(6.7) 振幅包絡e’_gen(i,t)と微細構造p_gen(i,t)から、次式により波形s’_gen(i,t)を復元する。 (6.6) The amplitude envelope e ′ _gen (i, t) is up-sampled from the sampling frequency f _e of the amplitude envelope to the sampling frequency f _{x of the} amplitude value. _Let e ′ _gen (i, t) be the _upsampled amplitude envelope.
(6.7) The waveform s ′ _gen (i, t) is restored from the amplitude envelope e ′ _gen (i, t) and the fine structure p _gen (i, t) by the following equation.

(6.8) 波形s’_gen(i,t)の分散が波形s_tex(i,t)の分散と等しくなるように、次式のように、波形s’_gen(i,t)にσ_{s_tex}(i)/σ_{s’_gen}(i)をかける。すなわち、波形s’_gen(i,t)にσ_{s_tex}(i)/σ_{s’_gen}(i)をかけたものを改めてs’_gen(i,t)とする。 (6.8) 'so that the variance of the _gen (i, t) is equal to the variance of the waveform s _tex (i, t), as in the following equation, the waveform s' waveform s _gen (i, t) to the sigma _{S_tex} ( i) _Multiply / σ _{s'_gen} (i). In other words, the waveform s ′ _gen (i, t) _multiplied by σ _{s_tex} (i) / σ _{s′_gen} (i) is changed to s ′ _gen (i, t).

(6.9) 波形s’_gen(i,t)にバンドパスフィルタバンクF_a(i)を適用する。適用後の波形を改めてs’_gen(i,t)とする。
(6.10) 波形s’_gen(i,t)を次式により足し合わせ、振幅値x’_gen(t)を算出する。 (6.9) Apply the bandpass filter bank F _a (i) to the waveform s ′ _gen (i, t). _Let s' _gen (i, t) be the waveform after application.
(6.10) The waveform s ′ _gen (i, t) is added according to the following equation to calculate the amplitude value x ′ _gen (t).

(6.11) x_gen(t)←x’_gen(t)とする。 (6.11) Let x _gen (t) ← x ' _gen (t).

次に、簡易化音信号生成アルゴリズム２について説明する。簡易化音信号生成アルゴリズム２は、振幅包絡e(i,t)を帯域分割した波形m(i,k,t)を計算しない点に加えて、帯域ごとの波形s(i,t)も計算しない点において、音信号生成アルゴリズムと異なる。 Next, the simplified sound signal generation algorithm 2 will be described. The simplified sound signal generation algorithm 2 calculates the waveform s (i, t) for each band in addition to not calculating the waveform m (i, k, t) obtained by dividing the amplitude envelope e (i, t) into bands. It differs from the sound signal generation algorithm in that it is not.

［簡易化音信号生成アルゴリズム２］
(1) 第１音の振幅値x_tex(t)から、振幅値x_tex(t)の振幅包絡e_tex(t)を計算する。つまり、振幅包絡e_tex(t)を振幅値x_tex(t)のヒルベルト変換の絶対値とする。
(2) 振幅包絡e_tex(t)の時間周辺統計量f_mean,tex、f_var,tex、f_skew,texを計算する。 [Simplified sound signal generation algorithm 2]
(1) from the amplitude value of the first sound x _tex (t), calculates the amplitude envelope e _tex (t) of the amplitude value x _tex (t). That is, the amplitude envelope e _tex (t) is the absolute value of the Hilbert transform of the amplitude value x _tex (t).
(2) Calculate the time marginal statistics f _{mean, tex} , f _{var, tex} , f _{skew, tex} of the amplitude envelope e _tex (t).

(3) 振幅値x_tex(t)の分散σ_{x_tex} ²を計算する。 (3) The variance σ _{x_tex} ² of the amplitude value x _tex (t) is calculated.

(4) 第２音の振幅値x_con(t)から、振幅値x_con(t)の振幅包絡e_con(i,t)を計算する。つまり、振幅包絡e_tex(t)を振幅値x_con(t)のヒルベルト変換の絶対値とする。
(5) 生成音x_gen(t)を白色雑音で初期化する。つまり、x_gen(t)←白色雑音とする。
(6) (6.1)から(6.11)をN_iter回繰り返す。
(6.1) 生成音の振幅値x_gen(t)から、振幅値x_gen(t)の振幅包絡e_gen(t)、振幅値x_gen(t)の微細構造p_gen(t)を計算する。つまり、振幅包絡e_gen(t)、微細構造p_gen(t)をそれぞれ振幅値x_gen(t)のヒルベルト変換の絶対値、偏角とする。
(6.2) 振幅包絡e_gen(t)の時間周辺統計量f_mean,gen、f_var,gen、f_skew,genを計算する。
(6.3) 特徴誤差をL=αL_con+(1-α)L_texとし、第２特徴誤差L_con、第１特徴誤差L_texをそれぞれ以下で計算する。
(6.3.1) 第２特徴誤差L_conを、次式で表される、第２音の振幅包絡e_con(t)と、生成音の振幅包絡e_gen(t)との二乗誤差とする。 (4) from the amplitude value x _con (t) of the second sound, it calculates the amplitude envelope e _con amplitude value _{x con (t) (i,} t). That is, the amplitude envelope e _tex (t) is the absolute value of the Hilbert transform of the amplitude value x _con (t).
(5) The generated sound x _gen (t) is initialized with white noise. That is, x _gen (t) ← white noise.
(6) Repeat (6.1) to (6.11) _Niter times.
(6.1) from the amplitude value x _gen product sound (t), calculates the amplitude envelope e _gen amplitude value x _gen (t) (t), the amplitude value x microstructure of _{_{gen (t) p gen (t}} ). That is, the amplitude envelope e _gen (t) and the fine structure p _gen (t) are the absolute value and declination of the Hilbert transform of the amplitude value x _gen (t), respectively.
(6.2) Compute the time marginal statistics f _{mean, gen} , f _{var, gen} , f _{skew, gen} of the amplitude envelope e _gen (t).
(6.3) The feature error is L = αL _con + (1−α) L _tex, and the second feature error L _con and the first feature error L _tex are calculated as follows.
(6.3.1) The second characteristic error L _con is a square error between the amplitude envelope e _con (t) of the second sound and the amplitude envelope e _gen (t) of the generated sound, which is expressed by the following equation.

(6.3.2) 第１特徴誤差L_texを、次式で表される、第１音の時間周辺統計量と生成音の時間周辺統計量との二乗誤差とする。 (6.3.2) Let the first characteristic error L _tex be the square error between the time-peripheral statistic of the first sound and the time-peripheral statistic of the generated sound, expressed by the following equation.

(6.4)確率勾配法によって、特徴誤差Lが小さくなるように振幅包絡e_gen(t)を変形する。変形した振幅包絡をe’_gen(t)とする。
(6.7) 振幅包絡e’_gen(t)と微細構造p_gen(t)から、次式により振幅値x’_gen(t)を復元する。 (6.4) The amplitude envelope e _gen (t) is transformed so that the feature error L is reduced by the probability gradient method. Let the deformed amplitude envelope be e ′ _gen (t).
(6.7) From the amplitude envelope e ′ _gen (t) and the fine structure p _gen (t), the amplitude value x ′ _gen (t) is restored by the following equation.

(6.8) 振幅値x’_gen(t)の分散が振幅値x_tex(t)の分散と等しくなるように、次式のように、振幅値x’_gen(t)にσ_{x_tex}/σ_{x’_gen}をかける。すなわち、振幅値x’_gen(t)にσ_{x_tex}/σ_{x’_gen}をかけたものを改めてx’_gen(t)とする。 (6.8) 'so that the variance of the _gen (t) is equal to the variance of the amplitude value x _tex (t), as in the following equation, the amplitude value x' amplitude value x _gen (t) to _σ x_tex / σ _{x ' Multiply _gen} . That is, 're x a multiplied by _{_σ} x_tex / σ x'_gen the _gen (t)' and _gen (t) the amplitude value x.

＜第一実施形態＞
以下、図２〜図３を参照して音信号生成装置１００について説明する。図２は、音信号生成装置１００の構成を示すブロック図である。図３は、音信号生成装置１００の動作を示すフローチャートである。図２に示すように音信号生成装置１００は、第１音特徴抽出部１１０、第２音特徴抽出部１２０、白色信号生成部１３０、信号変形部１４０、記録部１９０を含む。さらに、信号変形部１４０は、生成音特徴抽出部１４１、誤差評価信号変形部１４２を含む。記録部１９０は、音信号生成装置１００の処理に必要な情報を適宜記録する構成部である。 <First embodiment>
Hereinafter, the sound signal generation device 100 will be described with reference to FIGS. FIG. 2 is a block diagram illustrating a configuration of the sound signal generation device 100. FIG. 3 is a flowchart showing the operation of the sound signal generation device 100. As shown in FIG. 2, the sound signal generation device 100 includes a first sound feature extraction unit 110, a second sound feature extraction unit 120, a white signal generation unit 130, a signal transformation unit 140, and a recording unit 190. Further, the signal transformation unit 140 includes a generated sound feature extraction unit 141 and an error evaluation signal transformation unit 142. The recording unit 190 is a component that appropriately records information necessary for processing of the sound signal generation device 100.

図３に従い音信号生成装置１００の動作について説明する。第１音特徴抽出部１１０は、入力された第１音から時刻によらない特徴量である第１音特徴を抽出する（Ｓ１１０）。具体的には、まず、音信号生成アルゴリズム(1)により、第１音の振幅値x_tex(t)から、帯域ごとの波形s_tex(i,t)、波形s_tex(i,t)の振幅包絡e_tex(i,t)、振幅包絡e_tex(i,t)を帯域分割した波形m_tex(i,k,t)を計算する。次に、音信号生成アルゴリズム(2)により、振幅包絡e_tex(i,t)と波形m_tex(i,k,t)から、時間周辺統計量f_mean,tex(i)、f_var,tex(i)、f_skew,tex(i)、f_cce,tex(i,j)、f_pow,tex(i,k)、f_ccm,tex(i,j,k)を計算する。つまり、時間周辺統計量f_mean,tex(i)、f_var,tex(i)、f_skew,tex(i)、f_cce,tex(i,j)、f_pow,tex(i,k)、f_ccm,tex(i,j,k)が第１音特徴となる。また、音信号生成アルゴリズム(3)により、波形s_tex(i,t)からその分散σ_{s_tex}(i)²を計算する。 The operation of the sound signal generation device 100 will be described with reference to FIG. The first sound feature extraction unit 110 extracts a first sound feature, which is a feature quantity independent of time, from the input first sound (S110). Specifically, first, by the sound signal generation algorithm (1), the waveform s _tex (i, t) and the waveform s _tex (i, t) for each band are obtained from the amplitude value x _tex (t) of the first sound. A waveform m _tex (i, k, t) obtained by band-dividing the amplitude envelope e _tex (i, t) and the amplitude envelope e _tex (i, t) is calculated. Next, by the sound signal generation algorithm (2), the time-peripheral statistics f _{mean, tex} (i), f _{var, tex are calculated} from the amplitude envelope e _tex (i, t) and the waveform m _tex (i, k, t). _{(i), f skew, tex} (i), f cce, tex (i, j), f pow, tex (i, k), f ccm, calculates the _{tex (i, j, k)} . In other words, time-peripheral statistics f _{mean, tex} (i), f _{var, tex} (i), f _{skew, tex} (i), f _{cce, tex} (i, j), f _{pow, tex} (i, k), f _{ccm, tex} (i, j, k) is the first sound feature. Also, the variance σ _{s_tex} (i) ² is calculated from the waveform s _tex (i, t) by the sound signal generation algorithm (3).

なお、音信号生成アルゴリズム(6.3.2)の第１特徴誤差L_tex(i)の計算において、６つの時間周辺統計量すべてを用いる必要はない。つまり、少なくとも１つの時間周辺統計量を用いることにより第１特徴誤差L_tex(i)を計算するようにしてもよい。したがって、必ずしも、振幅包絡e_tex(i,t)と波形m_tex(i,k,t)の両方を計算する必要はないし、また、６つの時間周辺統計量すべてを計算する必要もない。すなわち、Ｓ１１０では、第１特徴誤差L_tex(i)の計算において必要となるもののみを計算すればよい。 Note that it is not necessary to use all six time-peripheral statistics in the calculation of the first characteristic error L _tex (i) of the sound signal generation algorithm (6.3.2). That is, the first feature error L _tex (i) may be calculated by using at least one time marginal statistic. Therefore, it is not always necessary to calculate both the amplitude envelope e _tex (i, t) and the waveform m _tex (i, k, t), and it is not necessary to calculate all six time marginal statistics. That is, in S110, it is only necessary to calculate what is necessary for calculating the first feature error L _tex (i).

第２音特徴抽出部１２０は、入力された第２音から時刻に依存する特徴量である第２音特徴を抽出する（Ｓ１２０）。具体的には、まず、音信号生成アルゴリズム(4)により、第２音の振幅値x_con(t)から、波形s_con(i,t)の振幅包絡e_con(i,t)、振幅包絡e_con(i,t)を帯域分割した波形m_con(i,k,t)を計算する。時間パターンである振幅包絡e_con(i,t)と波形m_con(i,k,t)が第２音特徴となる。 The second sound feature extraction unit 120 extracts a second sound feature, which is a feature amount dependent on time, from the input second sound (S120). Specifically, first, by the sound signal generation algorithm (4), from the amplitude value of the second sound x _con (t), amplitude envelope e _con waveform _{s con (i, t) (} i, t), amplitude envelope A waveform m _con (i, k, t) obtained by dividing the band of e _con (i, t) is calculated. The amplitude envelope e _con (i, t) and the waveform m _con (i, k, t), which are time patterns, are the second sound features.

なお、音信号生成アルゴリズム(6.3.1)の第２特徴誤差L_con(i)の計算において、振幅包絡e_con(i,t)と波形m_con(i,k,t)の両方を用いる必要はない。つまり、振幅包絡e_con(i,t)と波形m_con(i,k,t)のいずれか１つを用いることにより第２特徴誤差L_con(i)を計算するようにしてもよい。したがって、場合によっては、波形m_con(i,k,t)を計算する必要はない。 Note that it is necessary to use both the amplitude envelope e _con (i, t) and the waveform m _con (i, k, t) in the calculation of the second characteristic error L _con (i) of the sound signal generation algorithm (6.3.1). There is no. That is, the second feature error L _con (i) may be calculated by using one of the amplitude envelope e _con (i, t) and the waveform m _con (i, k, t). Therefore, in some cases, it is not necessary to calculate the waveform m _con (i, k, t).

白色信号生成部１３０は、出力となる生成音を白色雑音で初期化する（Ｓ１３０）。具体的には、音信号生成アルゴリズム(5)による。 The white signal generation unit 130 initializes the generated sound to be output with white noise (S130). Specifically, according to the sound signal generation algorithm (5).

信号変形部１４０は、第１音特徴と第２音特徴を用いて変形した生成音を出力する（Ｓ１４０）。生成音特徴抽出部１４１は、生成音から時刻によらない特徴量である第１生成音特徴と時刻に依存する特徴量である第２生成音特徴を抽出する（Ｓ１４１）。なお、第１生成音特徴と第２生成音特徴を生成音特徴という。誤差評価信号変形部１４２は、第１生成音特徴と第１音特徴との誤差である第１特徴誤差と第２生成音特徴と第２音特徴との誤差である第２特徴誤差とから計算される特徴誤差が小さくなるように、生成音を変形する（Ｓ１４２）。所定の条件が満たされた場合、誤差評価信号変形部１４２は、変形した生成音を出力する（Ｓ１４９）。例えば、生成音を変形する処理をN_iter回繰り返した後、出力する。 The signal transformation unit 140 outputs a generated sound transformed using the first sound feature and the second sound feature (S140). The generated sound feature extraction unit 141 extracts, from the generated sound, a first generated sound feature that is a feature quantity independent of time and a second generated sound feature that is a time-dependent feature quantity (S141). The first generated sound feature and the second generated sound feature are referred to as generated sound features. The error evaluation signal transformation unit 142 calculates from a first feature error that is an error between the first generated sound feature and the first sound feature, and a second feature error that is an error between the second generated sound feature and the second sound feature. The generated sound is deformed so as to reduce the feature error (S142). When the predetermined condition is satisfied, the error evaluation signal deformation unit 142 outputs the deformed generated sound (S149). For example, the process of transforming the generated sound is repeated _Niter times and then output.

以下、生成音特徴抽出部１４１及び誤差評価信号変形部１４２の動作について説明する。まず、生成音特徴抽出部１４１の動作について説明する。最初に、音信号生成アルゴリズム(6.1)により、生成音の振幅値x_gen(t)から、帯域ごとの波形s_gen(i,t)、波形s_gen(i,t)の振幅包絡e_gen(i,t)、波形s_gen(i,t)の微細構造p_gen(i,t)、振幅包絡e_gen(i,t)を帯域分割した波形m_gen(i,k,t)を計算する。時間パターンである振幅包絡e_gen(i,t)と波形m_gen(i,k,t)が第２生成音特徴である。続いて、音信号生成アルゴリズム(6.2)により、振幅包絡e_gen(i,t)と波形m_gen(i,k,t)の時間周辺統計量f_mean,gen(i)、f_var,gen(i)、f_skew,gen(i)、f_cce,gen(i,j)、f_pow,gen(i,k)、f_ccm,gen(i,j,k)を計算する。時間周辺統計量f_mean,gen(i)、f_var,gen(i)、f_skew,gen(i)、f_cce,gen(i,j)、f_pow,gen(i,k)、f_ccm,gen(i,j,k)が第１生成音特徴である。 The operations of the generated sound feature extraction unit 141 and the error evaluation signal transformation unit 142 will be described below. First, the operation of the generated sound feature extraction unit 141 will be described. First, according to the sound signal generation algorithm (6.1), the amplitude envelope e _gen () of the waveform s _gen (i, t) and waveform s _gen (i, t) for each band is _calculated from the amplitude value x _gen (t) of the generated sound. i, t), waveform s _gen (i, t), fine structure p _gen (i, t), amplitude envelope e _gen (i, t), waveform m _gen (i, k, t) . The amplitude envelope e _gen (i, t) and the waveform m _gen (i, k, t), which are time patterns, are the second generated sound features. Next, the sound signal generation algorithm (6.2) uses the time envelope statistics f _{mean, gen} (i), f _{var, gen} () of the amplitude envelope e _gen (i, t) and the waveform m _gen (i, k, t). i), f _{skew, gen} (i), f _{cce, gen} (i, j), f _{pow, gen} (i, k), f _{ccm, gen} (i, j, k) are calculated. Time marginal statistics f _{mean, gen} (i), f _{var, gen} (i), f _{skew, gen} (i), f _{cce, gen} (i, j), f _{pow, gen} (i, k), f _{ccm , gen} (i, j, k) is the first generated sound feature.

なお、ここで計算する生成音特徴は、Ｓ１１０やＳ１２０に対応する形で必要となるもののみでよい。 It should be noted that the generated sound features calculated here need only be those required in a form corresponding to S110 or S120.

次に、誤差評価信号変形部１４２の動作について説明する。最初に、音信号生成アルゴリズム(6.3)及び(6.4)により、帯域番号iについて、分散σ_{s_tex}(i)²が大きいiから順に、確率勾配法によって、特徴誤差L(i)が小さくなるように振幅包絡e_gen(i,t)を変形していく。これにより、変形した振幅包絡e’_gen(i,t)が得られる。 Next, the operation of the error evaluation signal transformation unit 142 will be described. First, according to the sound signal generation algorithms (6.3) and (6.4), for the band number i, the feature error L (i) is reduced by the stochastic gradient method in order from i with the largest variance σ _{s_tex} (i) ^2. The amplitude envelope e _gen (i, t) is transformed. Thereby, a deformed amplitude envelope e ′ _gen (i, t) is obtained.

なお、二乗誤差を用いて第１特徴誤差L_tex(i)及び第２特徴誤差L_con(i)を計算する代わり、誤差の絶対値を用いて第１特徴誤差L_tex(i)及び第２特徴誤差L_con(i)を計算するようにしてもよい。また、第１特徴誤差L_tex(i)及び第２特徴誤差L_con(i)を計算する際、各二乗和（各絶対値和）に重みを付けたうえで加算するようにしてもよい。 Incidentally, instead of calculating the first feature error L _tex (i) and the second characteristic error L _con (i) using a square error, the first feature using the absolute value of the error error L _tex (i) and a second The feature error L _con (i) may be calculated. In addition, when calculating the first feature error L _tex (i) and the second feature error L _con (i), each square sum (each absolute value sum) may be weighted and added.

続いて、音信号生成アルゴリズム(6.5)〜(6.11)により、変形した振幅包絡e’_gen(i,t)(i∈{1, 2, …, N_i})から生成音の振幅値x_gen(t)を計算する。 Subsequently, the amplitude value x _{gen of the} generated sound is generated from the modified amplitude envelope e ′ _gen (i, t) (i∈ {1, 2,..., N _i }) by the sound signal generation algorithms (6.5) to (6.11). Calculate (t).

本発明は上述の実施形態に限定されるものではない。例えば、Ｓ１１０〜Ｓ１３０の処理はこの順序でなく適宜入れ替えて実行してもよいし、並列に実行してもよい。 The present invention is not limited to the above-described embodiment. For example, the processing of S110 to S130 may be executed in an appropriate manner instead of this order, or may be executed in parallel.

本発明によれば、白色雑音から一般の音の任意の質感を持つ音を生成することができる。 According to the present invention, it is possible to generate a sound having an arbitrary texture of a general sound from white noise.

＜変形例＞
音信号生成装置１００では、各構成部は音信号生成アルゴリズムに基づいて動作するものとして説明したが、音信号生成アルゴリズムの代わりに、簡易化音信号生成アルゴリズム１や簡易化音信号生成アルゴリズム２に基づいて動作するものとしてもよい。 <Modification>
In the sound signal generation device 100, each component has been described as operating based on the sound signal generation algorithm. However, instead of the sound signal generation algorithm, the simplified sound signal generation algorithm 1 and the simplified sound signal generation algorithm 2 are used. It is good also as what operate | moves based.

この場合、第１音特徴、第２音特徴、第１生成音特徴、第２生成音特徴は、必ずしも帯域分割して帯域ごとに計算したものではないことになる。この点において、第１音特徴、第２音特徴、第１生成音特徴、第２生成音特徴が、ヒトの聴覚特性を考慮して帯域分割した帯域ごとに計算したものである音信号生成装置１００と異なる。 In this case, the first sound feature, the second sound feature, the first generated sound feature, and the second generated sound feature are not necessarily calculated by band division. In this respect, the sound signal generation device in which the first sound feature, the second sound feature, the first generated sound feature, and the second generated sound feature are calculated for each band divided in consideration of human auditory characteristics. Different from 100.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A first sound feature extraction unit that extracts a first sound feature that is a feature quantity independent of time from the input first sound;
A second sound feature extraction unit that extracts a second sound feature that is a time-dependent feature amount from the input second sound;
A white signal generator that initializes the output generated sound with white noise;
A sound signal generating device including: a signal deforming unit that outputs the generated sound deformed by using the first sound feature and the second sound feature;
The signal transformation unit is
A generated sound feature extracting unit that extracts a first generated sound feature that is a feature quantity independent of time and a second generated sound feature that is a time-dependent feature quantity from the generated sound;
Feature error calculated from a first feature error that is an error between the first generated sound feature and the first sound feature, and a second feature error that is an error between the second generated sound feature and the second sound feature A sound signal generating device including an error evaluation signal deforming unit that deforms the generated sound so that becomes smaller.

The sound signal generating device according to claim 1,
The first sound feature is a time-peripheral statistic calculated from the first sound,
The first generated sound feature is a time-peripheral statistic calculated from the generated sound.

The sound signal generating device according to claim 1 or 2,
The second sound feature is a time pattern calculated from the second sound,
The second generated sound feature is a time pattern calculated from the generated sound.

The sound signal generation device according to any one of claims 1 to 3,
The sound signal generation device according to claim 1, wherein the first sound feature and the first generated sound feature are calculated for each band obtained by dividing a band in consideration of human auditory characteristics.

The sound signal generation device according to claim 4,
x _tex (t) is the amplitude value of the first sound, s _tex (i, t) is a waveform for each band calculated from x _tex (t), and e _tex (i, t) is s _tex (i, t ) Amplitude envelope, m _tex (i, k, t) is a band-divided waveform of e _tex (i, t) (where t is a parameter representing discrete time, i and k are parameters representing band numbers),
x _gen (t) is the amplitude value of the generated sound, s _gen (i, t) is a waveform for each band calculated from x _gen (t), e _gen (i, t) is s _gen (i, t) The amplitude envelope of m _gen (i, k, t) is a band-divided waveform of e _gen (i, t) (where t is a parameter representing discrete time and i and k are parameters representing band numbers)
The first sound feature, e _tex (i, t) mean a is f _mean of _{a _{tex (i), e tex (}} i, t) the variance of the average of the square of e _tex (i, t) is a value obtained by dividing f _{var, tex} (i) _{and, e tex (i, t)} f skew is skewness of _{a _{tex (i), e tex (}} i, t) and e _tex (j, t) the correlation coefficient of a is _{f cce, tex (i, j} ) _{and, m tex (i, k,} t) root mean a m _tex of (i, k, t) f pow is divided by the _{dispersion, tex} (i, k), m _tex (i, k, t) and m _tex (j, k, t) are correlation coefficients f _{ccm, tex} (i, j, k),
Said first generating sound features, e _gen (i, t) mean a is f _{mean of} the _{_{gen (i), e gen (}} i, t) 2 square of the average of the variance of e _gen (i, t) is divided by f _var, and _{_{gen (i), e gen (}} i, t) a skewness is f _{skew of} the _{_{gen (i), e gen (}} i, t) and e _gen (j, t f _cce a correlation _{coefficient), gen (i,} j) and, m _gen (i, k, root mean a m _gen (i of t), k, is divided by the variance of t) f _{pow , gen} (i, k) and m _gen (i, k, t) and m _gen (j, k, t) are correlation coefficients f _{ccm, gen} (i, j, k) A sound signal generator.

The sound signal generation device according to any one of claims 1 to 5,
The sound signal generating device, wherein the second sound feature and the second generated sound feature are calculated for each band obtained by dividing a band in consideration of human auditory characteristics.

A first sound feature extracting step in which the sound signal generating device extracts a first sound feature that is a feature quantity independent of time from the input first sound;
A second sound feature extracting step in which the sound signal generating device extracts a second sound feature which is a feature quantity dependent on time from the input second sound;
The sound signal generation device initializes a generated sound to be output with white noise, and a white signal generation step;
The sound signal generation device includes a signal modification step of outputting the generated sound deformed using the first sound feature and the second sound feature,
The signal transformation step includes
A generated sound feature extracting step of extracting a first generated sound feature that is a feature quantity independent of time and a second generated sound feature that is a time-dependent feature quantity from the generated sound;
Feature error calculated from a first feature error that is an error between the first generated sound feature and the first sound feature, and a second feature error that is an error between the second generated sound feature and the second sound feature A sound signal generation method comprising: an error evaluation signal deformation step of deforming the generated sound so as to decrease

A program for causing a computer to function as the sound signal generation device according to any one of claims 1 to 6.