JP2020038284A

JP2020038284A - Acoustic processing device, acoustic processing method and program

Info

Publication number: JP2020038284A
Application number: JP2018165175A
Authority: JP
Inventors: 兆峰張; Zhao Feng Zhang; 一博中臺; Kazuhiro Nakadai; 住田　直亮; Naoaki Sumita; 直亮住田; 中島　弘史; Hiroshi Nakajima; 弘史中島
Original assignee: Honda Motor Co Ltd
Current assignee: Honda Motor Co Ltd
Priority date: 2018-09-04
Filing date: 2018-09-04
Publication date: 2020-03-12
Anticipated expiration: 2038-09-04
Also published as: JP7016307B2

Abstract

To provide an acoustic processing device, an acoustic processing method and a program that can easily obtain sound collected in a situation in which a sound collection position changes.SOLUTION: A sound collection position discretization unit discretizes a sound collection position as a position of a sound collection unit which moves, at predetermined time intervals, and a simulation unit acquires an impulse response indicative of transfer characteristics from a sound source position to the sound collection position. The impulse response includes N (N: an integer larger than 1) response coefficients from a "0"th response coefficient to an "N-1"th response coefficient at respective points of time. Then signal values from a signal value at current time t to a signal value at time t-(N-1) are used to perform convolutional operation on response coefficients from a "0"th response coefficient at the current time t to an "N-1"th response coefficient at time t-(N-1) which is an N-1 time before the current time t, and the signal values obtained by discretizing an acoustic signal that a sound source generates, at the predetermined intervals of time, thereby calculating a signal value at the current time t which indicates an acoustic signal at the sound collection position.SELECTED DRAWING: Figure 1

Description

本発明は、音響処理装置、音響処理方法およびプログラムに関する。 The present invention relates to a sound processing device, a sound processing method, and a program.

音声認識は、発話音声の内容を特定するための処理であり、人工知能（ＡＩ：Ａｒｔｉｆｉｃｉａｌｉｎｔｅｌｌｉｇｅｎｃｅ）の要素技術として様々な環境で応用されている。音声認識では、一般的に音声の物理的な特性を示す音響特徴量と発音との間の統計的な関係を示す音響モデルが用いられる。従来は、話者とマイクロホンとの位置関係が固定されている静的環境を前提として、音響モデルの学習が行われてきた。 Speech recognition is a process for specifying the content of an uttered voice, and is applied in various environments as an elemental technology of artificial intelligence (AI). In speech recognition, generally, an acoustic model indicating a statistical relationship between an acoustic feature indicating a physical characteristic of a speech and pronunciation is used. Conventionally, acoustic model learning has been performed on the premise of a static environment in which the positional relationship between a speaker and a microphone is fixed.

特開２００８−２１９８８４号公報JP 2008-219884 A

ＡＩの普及に伴い、音声認識は動的な環境で応用されることがある。例えば、音声認識エンジンがロボットなどの移動体に搭載されることがある。そのような場合には、音声を収音するためのマイクロホンも移動体に設置される。話者とマイクロホンとの位置関係が変化するので、収音される音声の音響特徴量も変化してしまう。そのため、動的環境で音声認識を実行する際、静的環境で学習された音響モデルをそのまま用いると認識率が低下しがちである。 With the spread of AI, speech recognition may be applied in a dynamic environment. For example, a voice recognition engine may be mounted on a moving object such as a robot. In such a case, a microphone for collecting sound is also installed on the moving body. Since the positional relationship between the speaker and the microphone changes, the acoustic feature of the collected sound also changes. Therefore, when speech recognition is performed in a dynamic environment, the recognition rate tends to decrease if an acoustic model learned in a static environment is used as it is.

動的な環境のもとで音声認識を実行する際、その環境に応じた音響モデルを用いて認識率を向上させることが期待される。話者とマイクロホンとの位置関係は、その都度変化しうるが、特許文献１に記載の手法では、マイクロホンが静止している場合を前提としている。しかしながら、音響モデルの学習のために現実にマイクロホンの位置、つまり収音位置を変化させながら学習用の音声データを取得することは煩雑である。そこで、収音位置が変化する環境を仮定して、収音される音声データを容易に取得することが期待される。 When speech recognition is performed in a dynamic environment, it is expected to improve the recognition rate by using an acoustic model corresponding to the environment. The positional relationship between the speaker and the microphone can change each time, but the method described in Patent Document 1 assumes that the microphone is stationary. However, it is complicated to acquire the audio data for learning while actually changing the position of the microphone, that is, the sound pickup position, for learning the acoustic model. Therefore, it is expected that the sound data to be picked up can be easily obtained assuming an environment in which the picked-up position changes.

本発明は上記の点に鑑みてなされたものであり、収音位置が変化する状況で収音される音を容易に取得することを課題とする。 The present invention has been made in view of the above points, and it is an object of the present invention to easily obtain a sound picked up in a situation where a sound pickup position changes.

（１）本発明は上記の課題を解決するためになされたものであり、本発明の一態様は、移動する収音部の位置である収音位置を所定時間間隔で離散化する収音位置離散化部と、音源位置から前記収音位置までの伝達特性を示すインパルス応答を取得し、前記インパルス応答は、時刻ごとに第０応答係数から第Ｎ−１応答係数までのＮ（Ｎは、１より大きい整数）個の応答係数を含み、現時刻ｔにおける第０応答係数から、当該現時刻ｔからＮ−１時刻前の時刻ｔ−（Ｎ−１）における第Ｎ−１応答係数までの応答係数と、音源が発する音響信号を前記所定時間間隔で離散化した信号値について、現時刻ｔにおける信号値から時刻ｔ−（Ｎ−１）における信号値までの信号値、を用いて畳み込み演算を行って、前記収音位置における音響信号を示す現時刻ｔにおける信号値を算出するシミュレーション部と、を備える音響処理装置である。 (1) The present invention has been made to solve the above-described problem, and one embodiment of the present invention is a sound pickup position that digitizes a sound pickup position that is a position of a moving sound pickup unit at predetermined time intervals. A discretization unit and an impulse response indicating a transfer characteristic from the sound source position to the sound pickup position are acquired, and the impulse response is N (N is from 0th response coefficient to N-1th response coefficient for each time). Response coefficients from the 0th response coefficient at the current time t to the (N-1) th response coefficient at a time t- (N-1) immediately before the current time t. Convolution operation using a response coefficient and a signal value obtained by discretizing an acoustic signal emitted from a sound source at the predetermined time interval from a signal value at the current time t to a signal value at a time t− (N−1). To obtain an acoustic signal at the sound pickup position. A simulation unit for calculating a signal value at to the current time t, a sound processing apparatus comprising a.

（２）本発明のその他の態様は、上述の音響処理装置であって、移動する前記音源位置を所定時間間隔で離散化する音源位置離散化部をさらに備え、前記シミュレーション部は、離散化した前記音源位置から前記収音位置までの伝達特性を示すインパルス応答を取得することを特徴とする。 (2) Another embodiment of the present invention is the above-described sound processing device, further comprising a sound source position discretization unit that discretizes the moving sound source position at predetermined time intervals, and wherein the simulation unit performs discretization. An impulse response indicating a transfer characteristic from the sound source position to the sound pickup position is obtained.

（３）本発明のその他の態様は、上述の音響処理装置であって、前記シミュレーション部は、前記応答係数を要素値として含むＴ＋Ｎ−１行Ｔ列（Ｔは、Ｎより大きい整数）のシミュレーション行列を生成し、前記シミュレーション行列の第０行から第Ｎ−２行までの第ｔ行は、時刻ｔにおける収音位置に基づく第ｔ応答係数から時刻ｔにおける収音位置に基づく第０応答係数までの応答係数と、Ｔ−（ｔ＋１）個の０を各列の要素値として含み、前記シミュレーション行列の第Ｎ−１行から第Ｔ−１行までの第ｔ行は、Ｔ−Ｎ＋１個の０と、時刻ｔにおける収音位置に基づく第Ｎ−１応答係数から時刻ｔにおける収音位置に基づく第０応答係数までの応答係数と、Ｔ−（ｔ＋１）個の０を各列の要素値として含み、前記シミュレーション行列の第Ｔ行から第Ｔ＋Ｎ−２行までの第ｔ行は、ｔ−Ｎ＋１個の０と、時刻ｔにおける収音位置に基づく第Ｎ−１応答係数から時刻ｔにおける収音位置に基づく第ｔ−Ｔ＋１応答係数までの応答係数を各列の要素値として含み、時刻０における前記信号値から時刻Ｔ−１における前記信号値までの信号値を各行の要素値として含む音響信号ベクトルを生成し、前記シミュレーション行列を前記音響信号ベクトルに乗算する。 (3) Another aspect of the present invention is the above-described sound processing device, wherein the simulation unit performs a simulation of T + N−1 rows and T columns (T is an integer greater than N) including the response coefficient as an element value. A t-th row from the 0th row to the (N-2) th row of the simulation matrix includes a 0th response coefficient based on the sound pickup position at time t to a 0th response coefficient based on the sound pickup position at time t. , And T- (t + 1) 0s as element values of each column, and the t-th row from the (N−1) th row to the T−1-th row of the simulation matrix has T−N + 1 0, the response coefficients from the (N−1) th response coefficient based on the sound pickup position at time t to the 0th response coefficient based on the sound pickup position at time t, and T− (t + 1) 0s are the element values of each column. As the simulation line The t-th row from the T-th row to the (T + N-2) -th row has t-N + 1 zeros and the (n−1) th response coefficient based on the sound pickup position at time t, and the t-th line based on the sound pickup position at time t. A response signal up to −T + 1 response coefficient is included as an element value of each column, and an acoustic signal vector including a signal value from the signal value at time 0 to the signal value at time T−1 as an element value of each row is generated. The simulation matrix is multiplied by the acoustic signal vector.

（４）本発明のその他の態様は、音響処理装置における音響処理方法であって、前記音響処理装置は、移動する収音部の位置である収音位置を所定時間間隔で離散化する収音位置離散化過程と、音源位置から前記収音位置までの伝達特性を示すインパルス応答を取得し、前記インパルス応答は、時刻ごとに第０応答係数から第Ｎ−１応答係数までのＮ（Ｎは、１より大きい整数）個の応答係数を含み、現時刻ｔにおける第０応答係数から、当該現時刻ｔからＮ−１時刻前の時刻ｔ−（Ｎ−１）における第Ｎ−１応答係数までの応答係数と、音源が発する音響信号を前記所定時間間隔で離散化した信号値について、現時刻ｔにおける信号値から時刻ｔ−（Ｎ−１）における信号値までの信号値、を用いて畳み込み演算を行って、前記収音位置における音響信号を示す現時刻ｔにおける信号値を算出するシミュレーション過程と、を有する。 (4) Another aspect of the present invention is a sound processing method in a sound processing apparatus, wherein the sound processing apparatus discretizes a sound collection position, which is a position of a moving sound collection unit, at predetermined time intervals. A position discretization process and an impulse response indicating a transfer characteristic from the sound source position to the sound pickup position are acquired. , Which is an integer greater than 1), from the 0th response coefficient at the current time t to the (N-1) th response coefficient at a time t- (N-1) which is N-1 times before the current time t. And a signal value obtained by discretizing the acoustic signal emitted from the sound source at the predetermined time interval from the signal value at the current time t to the signal value at the time t− (N−1). Perform the calculation to find the sound pickup position Kicking with a simulation step of calculating the signal value at the current time t showing an acoustic signal.

（５）本発明のその他の態様は、音響処理装置のコンピュータに、移動する収音部の位置である収音位置を所定時間間隔で離散化する収音位置離散化手順と、音源位置から前記収音位置までの伝達特性を示すインパルス応答を取得し、前記インパルス応答は、時刻ごとに第０応答係数から第Ｎ−１応答係数までのＮ（Ｎは、１より大きい整数）個の応答係数を含み、現時刻ｔにおける第０応答係数から、当該現時刻ｔからＮ−１時刻前の時刻ｔ−（Ｎ−１）における第Ｎ−１応答係数までの応答係数と、音源が発する音響信号を前記所定時間間隔で離散化した信号値について、現時刻ｔにおける信号値から時刻ｔ−（Ｎ−１）における信号値までの信号値、を用いて畳み込み演算を行って、前記収音位置における音響信号を示す現時刻ｔにおける信号値を算出するシミュレーション手順と、を実行させるためのプログラムである。 (5) In another aspect of the present invention, a computer of the sound processing apparatus includes a sound collecting position discretizing step of discretizing a sound collecting position, which is a position of a moving sound collecting unit, at predetermined time intervals, and An impulse response indicating a transfer characteristic up to the sound pickup position is obtained, and the impulse response is N (N is an integer greater than 1) response coefficients from a 0th response coefficient to an N-1th response coefficient at each time. And a response coefficient from the 0th response coefficient at the current time t to an N-1th response coefficient at a time t- (N-1) immediately before the current time t and an N-1 time point, and an acoustic signal emitted by the sound source. Is convolved using the signal values from the signal value at the current time t to the signal value at the time t− (N−1) for the signal value obtained by discretizing the signal at the predetermined time interval. At the current time t indicating the sound signal, And simulation procedure for calculating the that signal value, which is a program for causing the execution.

本発明の態様（１）、（４）及び（５）によれば、移動する収音部で収音される収音信号に近似する合成信号を容易に取得することができる。
本発明の態様（２）によれば、移動する音源から発される音に応じて収音される収音信号に近似する合成信号を容易に取得することができる。
本発明の態様（３）によれば、音源信号に基づく音響信号ベクトルに対する、音源位置と移動する音源位置に対応するインパルス応答の応答係数を要素値として含むインパルス応答行列の乗算により、収音信号ベクトルが得られる。そのため、複雑な演算を要さずに収音信号の信号値を容易に得ることができる。 According to the aspects (1), (4) and (5) of the present invention, it is possible to easily obtain a synthesized signal that is similar to a sound pickup signal picked up by a moving sound pickup unit.
According to the aspect (2) of the present invention, it is possible to easily obtain a synthesized signal that is similar to a sound pickup signal picked up in accordance with a sound emitted from a moving sound source.
According to the aspect (3) of the present invention, a sound pickup signal is obtained by multiplying an acoustic signal vector based on a sound source signal by an impulse response matrix including, as element values, a response coefficient of an impulse response corresponding to a sound source position and a moving sound source position. The vector is obtained. Therefore, the signal value of the picked-up signal can be easily obtained without requiring a complicated operation.

第１の実施形態に係る音響処理装置の構成例を示す概略ブロック図である。1 is a schematic block diagram illustrating a configuration example of a sound processing device according to a first embodiment. 第１の実施形態に係るシミュレーション方法を説明するための説明図である。FIG. 3 is an explanatory diagram for describing a simulation method according to the first embodiment. 第１の実施形態に係る合成信号生成処理の例を示すフローチャートである。5 is a flowchart illustrating an example of a synthesized signal generation process according to the first embodiment. 第２の実施形態に係る音響処理装置の構成例を示す概略ブロック図である。FIG. 7 is a schematic block diagram illustrating a configuration example of a sound processing device according to a second embodiment. 第２の実施形態に係るシミュレーション方法を説明するための説明図である。FIG. 9 is an explanatory diagram for describing a simulation method according to a second embodiment.

（第１の実施形態）
以下、図面を参照しながら本発明の第１の実施形態について説明する。
図１は、本実施形態に係る音響処理装置１の構成例を示す概略ブロック図である。
音響処理装置１は、音源信号取得部１１、収音位置取得部１２、収音位置離散化部１３、シミュレーション部１４および合成信号生成部１５を含んで構成される。 (First embodiment)
Hereinafter, a first embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is a schematic block diagram illustrating a configuration example of a sound processing device 1 according to the present embodiment.
The sound processing device 1 includes a sound source signal acquisition unit 11, a sound collection position acquisition unit 12, a sound collection position discretization unit 13, a simulation unit 14, and a composite signal generation unit 15.

音源信号取得部１１は、処理対象の音響信号として音源信号を取得する。音源信号は、所定のサンプリング周波数（例えば、８ｋＨｚ−４８ｋＨｚ）に対応する時間間隔でサンプリングされた時刻ごとの信号値の時系列からなるディジタル音響信号である。各サンプル時刻の信号値は、その時点における音の強度を示す。音源信号取得部１１は、取得した音源信号をシミュレーション部１４に出力する。
音源信号取得部１１は、例えば、マイクロホン（図示せず）から入力されるアナログ音響信号をディジタル音響信号に変換するためのＡＤ（Ａｎａｌｏｇ／Ｄｉｇｉｔａｌ；アナログ・ディジタル）変換器を備える。マイクロホンは、音響処理装置１に内蔵されてもよいし、音響処理装置１とは別体であってもよい。音源信号取得部１１は、自装置とは別個の他機器から音響信号を入力するための入出力インタフェースであってもよい。
また、音源信号取得部１１は、自部に入力される指令（コマンド）で指示される音響信号を格納したデータファイルを、自装置の記憶部（図示せず）から読み出してもよい。音源信号取得部１１に入力される指令は、他機器から入力される指令であってもよいし、操作部（図示せず）から入力される操作信号で伝達される命令であってもよい。 The sound source signal acquisition unit 11 acquires a sound source signal as a sound signal to be processed. The sound source signal is a digital sound signal composed of a time series of signal values for each time sampled at time intervals corresponding to a predetermined sampling frequency (for example, 8 kHz to 48 kHz). The signal value at each sample time indicates the sound intensity at that time. The sound source signal acquisition unit 11 outputs the acquired sound source signal to the simulation unit 14.
The sound source signal acquisition unit 11 includes, for example, an AD (Analog / Digital) converter for converting an analog audio signal input from a microphone (not shown) into a digital audio signal. The microphone may be built in the sound processing device 1 or may be separate from the sound processing device 1. The sound source signal acquisition unit 11 may be an input / output interface for inputting an acoustic signal from another device separate from the own device.
Further, the sound source signal acquisition unit 11 may read a data file storing an acoustic signal indicated by a command input to the own unit from a storage unit (not shown) of the own device. The command input to the sound source signal acquisition unit 11 may be a command input from another device or a command transmitted by an operation signal input from an operation unit (not shown).

収音位置取得部１２は、シミュレーションの条件の一つの要素である収音位置を示す収音位置信号を取得する。収音位置は、音を収音する収音部（例えば、マイクロホン）の仮想的な位置である。収音位置は、一般に移動、つまり時間の経過に応じて変動しうる。収音位置取得部１２は、例えば、所定の収音位置の移動パターンを示す収音位置信号を生成する。収音位置取得部１２は、操作部（図示せず）から入力される操作信号で逐次に指示される収音位置を示す収音位置信号を生成してもよい。収音位置取得部１２は、生成した収音位置信号を収音位置離散化部１３に出力する。 The sound collection position acquisition unit 12 acquires a sound collection position signal indicating a sound collection position, which is one element of the simulation conditions. The sound collection position is a virtual position of a sound collection unit (for example, a microphone) that collects sound. The sound pickup position generally moves, that is, can fluctuate as time passes. The sound pickup position acquisition unit 12 generates, for example, a sound pickup position signal indicating a movement pattern of a predetermined sound pickup position. The sound collection position acquisition unit 12 may generate a sound collection position signal indicating a sound collection position sequentially specified by an operation signal input from an operation unit (not shown). The sound collection position acquisition unit 12 outputs the generated sound collection position signal to the sound collection position discretization unit 13.

収音位置離散化部１３は、収音位置取得部１２から入力される収音位置信号が示す収音位置を所定のサンプリング周波数に対応する時間間隔でサンプリングすることにより離散化する。このサンプリング周波数は、音響信号のサンプリング周波数と等しい周波数である。入力される収音位置信号は、音響信号のサンプリング時刻とは異なる離散化時刻ごとに離散化された収音位置を示すディジタル信号でありうる。その場合には、収音位置離散化部１３は、収音位置信号が示す時刻ごとの収音位置を補間して、そのサンプリング周波数に対応する時間間隔で離散化された時刻ごとに収音位置を算出する。収音位置離散化部１３は、離散化された収音位置信号をシミュレーション部１４に出力する。 The sound collection position discretization unit 13 discretizes the sound collection position indicated by the sound collection position signal input from the sound collection position acquisition unit 12 by sampling at a time interval corresponding to a predetermined sampling frequency. This sampling frequency is a frequency equal to the sampling frequency of the acoustic signal. The input sound pickup position signal may be a digital signal indicating a sound pickup position discretized at each discretization time different from the sampling time of the acoustic signal. In that case, the sound pickup position discretizing unit 13 interpolates the sound pickup position for each time indicated by the sound pickup position signal, and collects the sound pickup position for each time discretized at a time interval corresponding to the sampling frequency. Is calculated. The sound collection position discretization unit 13 outputs the discretized sound collection position signal to the simulation unit 14.

シミュレーション部１４には、音源信号取得部１１から音源信号が入力され、収音位置離散化部１３から収音位置信号が入力される。
シミュレーション部１４は、インパルス応答取得部１４２を備える。インパルス応答取得部１４２には、音源位置から収音位置までの音の伝達特性を示すインパルス応答の生成モデルを示すモデルデータを予め設定させておく。また、本実施形態では、音源位置は所定の位置に静止していることを仮定する。
インパルス応答取得部１４２は、モデルデータを用いて、時刻ごとに音源位置から収音位置信号が示す収音位置までのインパルス応答を生成する。個々のインパルス応答は、Ｎ（Ｎは、２以上の整数）個の応答係数を含んで構成される。以下の説明では、個々の応答係数を、第ｎ（ｎは、０からＮ−１までの整数）と呼ぶ。インパルス応答の長さである応答期間は、例えば、シミュレーション対象の音源位置と収音位置を含む空間の残響時間（例えば、０．１ｓ〜２．０ｓ）と同じ程度であってもよい。従って、インパルス応答の次数Ｎは、例えば、応答期間をサンプリング間隔で除算して得られる実数値を整数に丸めた値であってもよい。 The simulation unit 14 receives a sound source signal from the sound source signal acquisition unit 11 and a sound collection position signal from the sound collection position discretization unit 13.
The simulation unit 14 includes an impulse response acquisition unit 142. In the impulse response acquisition unit 142, model data indicating a generation model of an impulse response indicating a sound transmission characteristic from a sound source position to a sound pickup position is set in advance. In the present embodiment, it is assumed that the sound source position is stationary at a predetermined position.
The impulse response acquisition unit 142 generates an impulse response from the sound source position to the sound pickup position indicated by the sound pickup position signal at each time using the model data. Each impulse response includes N (N is an integer of 2 or more) response coefficients. In the following description, each response coefficient is referred to as an n-th (n is an integer from 0 to N-1). The response period, which is the length of the impulse response, may be, for example, about the same as the reverberation time (for example, 0.1 s to 2.0 s) of the space including the simulation target sound source position and the sound pickup position. Therefore, the order N of the impulse response may be, for example, a value obtained by dividing a real value obtained by dividing a response period by a sampling interval to an integer.

シミュレーション部１４は、離散化された時刻ｔごとに生成したインパルス応答を用いて音源信号に対して畳み込み演算を行う。ここで、シミュレーション部１４は、現時刻ｔにおける第０応答係数ｈ’_ｑ（ｔ）（０）から、現時刻ｔから（Ｎ−１）時刻前の時刻ｔ−（Ｎ−１）における第Ｎ−１応答係数ｈ’_{ｑ（ｔ−（Ｎ−１））}（Ｎ−１）までのそれぞれを、現時刻ｔにおける音源信号の信号値ｓ（ｔ）から、現時刻ｔから第Ｎ−１時刻前における信号値ｓ（ｔ−（Ｎ−１））に乗算して得られる乗算値の総和を、現時刻ｔにおける収音位置ｑ（ｔ）で収音されうる音の信号値ｘ’（ｔ）として算出する。シミュレーション部１４は、算出した信号値を合成信号生成部１５に出力する。 The simulation unit 14 performs a convolution operation on the sound source signal using the impulse response generated at each discrete time t. Here, the simulation unit 14 _{calculates the} N-th response coefficient h ′ _{q (t)} (0) at the current time t from the N-th response coefficient at the time t− (N−1) before (N−1) time before the current time t. -1 response coefficient _{h′q (t− (N−1))} (N−1) is calculated from the signal value s (t) of the sound source signal at the current time t to the (N−1) th time from the current time t. The sum of the multiplied values obtained by multiplying the previous signal value s (t− (N−1)) by the signal value x ′ (t of the sound that can be collected at the sound collection position q (t) at the current time t. ). The simulation unit 14 outputs the calculated signal value to the composite signal generation unit 15.

合成信号生成部１５は、シミュレーション部１４から入力される信号値の時系列を示す合成信号を生成する。合成信号は、シミュレーションによって算出された信号値の時系列を示す。合成信号生成部１５は、例えば、生成した合成信号を他機器に出力する。出力先となる機器は、例えば、音声認識装置、スピーカ（図示せず）などである。音声認識装置は、音響モデル学習部（図示せず）を備え、合成信号生成部１５から入力される合成信号を用いて音響モデルを生成することができる。スピーカは、合成信号生成部１５から入力される合成信号に基づく音を再生する。スピーカにより、移動する収音位置に到来する音が再生される。また、合成信号生成部１５は、生成した合成信号を他機器に出力せずに、自装置の記憶部（図示せず）に記憶してもよい。 The synthesized signal generation unit 15 generates a synthesized signal indicating a time series of signal values input from the simulation unit 14. The composite signal indicates a time series of signal values calculated by the simulation. The composite signal generation unit 15 outputs the generated composite signal to another device, for example. The output destination device is, for example, a voice recognition device, a speaker (not shown), or the like. The speech recognition device includes an acoustic model learning unit (not shown), and can generate an acoustic model using the synthesized signal input from the synthesized signal generation unit 15. The speaker reproduces a sound based on the composite signal input from the composite signal generation unit 15. The sound arriving at the moving sound pickup position is reproduced by the speaker. Further, the composite signal generation unit 15 may store the generated composite signal in a storage unit (not shown) of the own device without outputting the generated composite signal to another device.

（シミュレーション方法）
次に、本実施形態に係るシミュレーション方法について説明する。
図２は、時刻ｔにおけるインパルス応答ｈ’_ｑ（ｔ）を例示する。インパルス応答ｈ’_ｑ（ｔ）は、音源Ｓｒの位置である音源位置ｐから収音部Ｍｃの位置である収音位置ｑ（ｔ）までの音の伝達特性を示す。時刻ｔにおけるインパルス応答ｈ’_ｑ（ｔ）は、第０次の応答係数ｈ’_ｑ（ｔ）（０）から第Ｎ−１次の応答係数ｈ’_ｑ（ｔ）（Ｎ−１）をそれぞれ要素として有するＮ次元のベクトル[ｈ’_ｑ（ｔ）（０），ｈ’_ｑ（ｔ）（１），ｈ’_ｑ（ｔ）（２），…，ｈ_ｑ（ｔ）（Ｎ−１）]^Ｔとして表される。ここで、［…］^Ｔは、ベクトルもしくは行列［…］の転置を示す。なお、本願では、ベクトルもしくは行列の最初の行、列を、それぞれ第０行、第０列とする。 (Simulation method)
Next, a simulation method according to the present embodiment will be described.
FIG. 2 illustrates the impulse response h ′ _{q (t)} at time t. The impulse response h ′ _{q (t)} indicates a sound transmission characteristic from the sound source position p, which is the position of the sound source Sr, to the sound collecting position q (t), which is the position of the sound collecting unit Mc. The impulse response h ′ _{q (t) at} time t is obtained by calculating the 0th-order response coefficient _{h′q (t)} (0) to the N−1th-order response coefficient _{h′q (t)} (N−1), respectively. N-dimensional vector [h ′ _{q (t)} (0), h ′ _{q (t)} (1), h ′ _{q (t)} (2),..., H _{q (t)} (N−1) ] ^Represented as ^T. Here, [...] ^T indicates transposition of a vector or matrix [...]. In the present application, the first row and column of a vector or matrix are referred to as a 0th row and a 0th column, respectively.

音源位置ｐ、収音位置ｑ（ｔ）ともに静止している場合には、時刻ｔにおいて収音位置ｑ（ｔ）で収音される音響信号の信号値ｘ’_ｑ（ｔ）（ｔ）は、従来の手法と同様に音源信号ｓ（ｔ）に対してインパルス応答ｈ’_ｑ（ｔ）を畳み込み演算を行って算出される。畳み込み演算は、現時刻ｔよりも所定サンプルτ（τは、１以上Ｎ−１以下の整数）前の過去の時刻ｔ−τにおける音源信号の信号値ｓ（ｔ−τ）の現時刻ｔにおける信号値ｘ’_ｑ（ｔ）（ｔ）に対する寄与率を第τ次の応答係数ｈ’_ｑ（ｔ）（τ）とする数理モデルとみなすこともできる。 When both the sound source position p and the sound collection position q (t) are stationary, the signal value x ′ _{q (t)} (t) of the acoustic signal collected at the sound collection position q (t) at time t is In a manner similar to the conventional method, the sound source signal s (t) is calculated by performing a convolution operation on the impulse response h ′ _{q (t)} . The convolution operation is performed at the current time t of the signal value s (t-τ) of the sound source signal at the past time t−τ before a predetermined sample τ (τ is an integer of 1 to N−1) before the current time t. It can also be regarded as a mathematical model in which the contribution rate to the signal value x ′ _{q (t)} (t) is the τ-th order response coefficient h ′ _{q (t)} (τ).

但し、本実施形態では、時刻ｔの経過に伴う収音位置ｑ（ｔ）の変化によりインパルス応答ｈ’_ｑ（ｔ）が変化する。そこで、シミュレーション部１４は、畳み込み演算において、過去の時刻ｔ−τにおける音源信号の信号値ｓ（ｔ−τ）の現時刻ｔにおける信号値ｘ’_ｑ（ｔ）（ｔ）に対する寄与率として、時刻ｔ−τにおける第τ次の応答係数ｈ’_{ｑ（ｔ−τ）}（τ）を用いる。言い換えれば、シミュレーション部１４は、現時刻ｔにおける第０応答係数ｈ_ｑ（ｔ）（０）から、現時刻ｔから（Ｎ−１）時刻前の時刻ｔ−（Ｎ−１）における第Ｎ−１応答係数ｈ’_{ｑ（ｔー（Ｎ−１））}（Ｎ−１）までのそれぞれを、現時刻ｔにおける音源信号の信号値ｓ（ｔ）から、現時刻ｔから第Ｎ−１時刻前における信号値ｓ（ｔ−（Ｎ−１））までのそれぞれに乗算して得られる乗算値の総和を、現時刻ｔにおける収音位置ｑ（ｔ）で収音されうる音の信号値ｘ’_ｑ（ｔ）（ｔ）として算出する。 However, in the present embodiment, the impulse response h ′ _{q (t)} changes due to a change in the sound collection position q (t) with the passage of the time t. Therefore, in the convolution operation, the simulation unit 14 calculates the contribution ratio of the signal value s (t−τ) of the sound source signal at the past time t−τ to the signal value x ′ _{q (t)} (t) at the current time t. The τ-th order response coefficient h ′ _{q (t−τ)} (τ) at time t−τ is used. In other words, the simulation unit 14 calculates the N-th response coefficient _{hq (t)} (0) at the current time t from the N-th response coefficient at the time t- (N-1) at a time (N-1) time before the current time t. One response coefficient h ′ _{q (t− (N−1))} (N−1) is calculated based on the signal value s (t) of the sound source signal at the current time t and before the (N−1) th time before the current time t. Is multiplied by the signal value s (t− (N−1)) at the current time t, and the signal value x ′ of the sound that can be collected at the sound collection position q (t) at the current time t _{q (t)} is calculated as (t).

シミュレーション部１４は、式（１）に示すように、音源信号ベクトルｓに、インパルス応答行列Ｈ’_ｑ（ｔ）を乗じて合成信号ベクトルｘ’_ｑ（ｔ）を算出することができる。 Simulation unit 14, as shown in Equation (1) can be calculated to the sound source signal vector s, 'multiplied by _{q (t)} synthesized signal vector x' impulse response matrix H _q a _(t).

音源信号ベクトルｓは、［ｓ（０），ｓ（１），ｓ（２），…，ｓ（ｔ），…，ｓ（Ｔ−１）］^Ｔと表される。つまり、音源信号ベクトルｓは、第ｔ次元の要素として時刻ｔにおける音源信号の信号値ｓ（ｔ）を含むＴ次元の列ベクトルである。Ｔは、計算対象とする音源信号のサンプル数（期間）を示す。Ｔは、インパルス応答の次数Ｎよりも大きい整数である。
合成信号ベクトルｘ’_ｑ（ｔ）は、［ｘ’_ｑ（０）（０），ｘ’_ｑ（１）（１），ｘ’_ｑ（２）（２），…，ｘ’_ｑ（ｔ）（ｔ），…，ｘ’_{ｑ（Ｔ＋Ｎ−２）}（Ｔ＋Ｎ−２）］^Ｔと表される。つまり、合成信号ベクトルｘ’_ｑ（ｔ）は、第ｔ次元の要素として時刻ｔにおける合成信号の信号値ｘ’_ｑ（ｔ）（ｔ）を含むＴ＋Ｎ−１次元のベクトルである。
インパルス応答行列Ｈ’_ｑ（ｔ）は、［ｈ’_０，ｈ’_１，ｈ’_２，…，ｈ’_ｔ，…，ｈ’_{Ｔ＋Ｎ−２}］^Ｔと表される。つまり、インパルス応答行列Ｈ’_ｑ（ｔ）は、第ｔ行の要素としてＴ次元の要素ベクトルｈ’_ｔを含むＴ＋Ｎ−１行Ｔ列の行列である。 The sound source signal vector s is represented as [s (0), s (1), s (2), ..., s (t), ..., s (T-1)] ^T. That is, the sound source signal vector s is a T-dimensional column vector including the signal value s (t) of the sound source signal at the time t as the t-dimensional element. T indicates the number of samples (period) of the sound source signal to be calculated. T is an integer greater than the order N of the impulse response.
The composite signal vector x ′ _{q (t)} is [x ′ _{q (0)} (0), x ′ _{q (1)} (1), x ′ _{q (2)} (2),..., X ′ _{q (t)} (t), ..., x ' q (T + N-2) (T + N-2)] it is expressed as ^T. That is, the synthesized signal vector x ′ _{q (t)} is a T + N−1-dimensional vector including the signal value x ′ _{q (t)} (t) of the synthesized signal at the time t as the t-dimensional element.
Impulse response matrix H _{'q (t)} _{is, [h' 0, h '} 1, h' 2, ..., h 't, ..., h' T + N-2] is expressed as ^T. That is, the impulse response matrix H ′ _{q (t)} is a matrix of T + N−1 rows and T columns including a T-dimensional element vector h ′ _t as an element of the t-th row.

要素ベクトルｈ’_ｔは、それぞれ次式で表されるＴ次元の行ベクトルである。 Each of the element vectors h ′ _t is a T-dimensional row vector represented by the following equation.

シミュレーション部１４は、次に説明する手順で合成信号の信号値を算出する。
図３は、本実施形態に係る合成信号生成処理の例を示すフローチャートである。
（ステップＳ１０２）シミュレーション部１４は、音源信号の時刻０における信号値ｓ（０）から時刻Ｎ−１における信号値ｓ（Ｎ−１）まで、その順序で配列して音源信号ベクトルｓを構成する。
（ステップＳ１０４）インパルス応答取得部１４２は、予め設定されたモデルデータを用いて、時刻０における収音位置ｑ（０）に対応するインパルス応答ｈ’_ｑ（０）から時刻Ｔ＋Ｎ−２における収音位置ｑ（Ｔ＋Ｎ−２）までのインパルス応答ｈ’_{ｑ（Ｔ＋Ｎ−２）}を生成する。 The simulation unit 14 calculates the signal value of the composite signal according to the procedure described below.
FIG. 3 is a flowchart illustrating an example of the composite signal generation process according to the present embodiment.
(Step S102) The simulation unit 14 configures the sound source signal vector s by arranging the sound source signal from the signal value s (0) at time 0 to the signal value s (N-1) at time N-1 in that order. .
(Step S104) The impulse response acquisition unit 142 uses the model data set in advance to collect sound at time T + N-2 from the impulse response h ′ _{q (0)} corresponding to the sound collection position q (0) at time 0. An impulse response h ′ _{q (T + N−2) up} to a position q (T + _N−2) is generated.

（ステップＳ１０６）シミュレーション部１４は、生成したインパルス応答ｈ’_ｑ（０）−ｈ’_{ｑ（Ｔ＋Ｎ−２）}からインパルス応答行列Ｈ’を構成する。インパルス応答行列Ｈ’を構成する際、第０行から第Ｎ−２行までの第ｔ行において、シミュレーション部１４は、時刻ｔにおける第ｔ応答係数ｈ’_ｑ（ｔ）（ｔ）から第０応答係数ｈ’_ｑ（ｔ）（０）までのｔ＋１個の応答係数と、Ｔ−（ｔ＋１）個の０（ゼロ；スカラ値）を各列の要素値として、その順序で配列する。
第Ｎ−１行から第Ｔ−１行までの第ｔ行において、シミュレーション部１４は、ｔ−Ｎ＋１個の０と、時刻ｔにおける第Ｎ−１応答係数ｈ’_ｑ（ｔ）（Ｎ−１）から第０応答係数ｈ’_ｑ（ｔ）（０）までのＮ個の応答係数と、Ｔ−（ｔ＋１）個の０を各列の要素値として、その順序で配列する。第Ｔ行から第Ｔ＋Ｎ−２行までの第ｔ行において、シミュレーション部１４は、ｔ−Ｎ＋１個の０と、時刻ｔにおける第Ｎ−１応答係数ｈ’_ｑ（ｔ）（Ｎ−１）から第ｔ−Ｔ＋１応答係数ｈ’_ｑ（ｔ）（ｔ−Ｔ＋１）までのＴ＋Ｎ−（ｔ＋１）個の応答係数を、各列の要素値として、その順序で配列する。
（ステップＳ１０８）シミュレーション部１４は、音源信号ベクトルｓにインパルス応答行列Ｈ’を乗算して合成信号ベクトルｘ’_ｑ（ｔ）を算出する。シミュレーション部１４は、合成信号ベクトルｘ’_ｑ（ｔ）の要素値ｘ’_ｑ（ｔ）（ｔ）を時刻ｔにおける合成信号の信号値として合成信号生成部１５に出力する。 (Step S106) The simulation unit 14 constructs an impulse response matrix H ′ from the generated impulse responses h ′ _{q (0)} −h ′ _{q (T + N−2)} . When constructing the impulse response matrix H ′, in the t-th row from the 0-th row to the (N−2) -th row, the simulation unit 14 calculates the t-th response coefficient h ′ _{q (t)} (t) at the time t from the 0-th to the 0-th. T + 1 response coefficients up to a response coefficient h ′ _{q (t)} (0) and T− (t + 1) 0s (zero; scalar value) are arrayed in that order as element values of each column.
In the t-th row from the (N−1) -th row to the (T−1) -th row, the simulation unit 14 calculates t−N + 1 zeros and the (N−1 _{) -th} response coefficient h ′ _{q (t)} (N−1 _{) at} time t. ) To the 0th response coefficient h ′ _{q (t)} (0) and T− (t + 1) 0s are arranged in that order as element values of each column. In the t-th row from the T-th row to the (T + N−2) -th row, the simulation unit 14 calculates t−N + 1 zeros and the (N−1) th response coefficient h ′ _{q (t)} (N−1) at time t. T + N- (t + 1) response coefficients up to the (t-T + 1) th response coefficient _{h'q (t)} (t-T + 1) are arranged in that order as element values of each column.
(Step S108) The simulation unit 14 calculates a synthesized signal vector _{x'q (t)} by multiplying the sound source signal vector s by the impulse response matrix H '. The simulation unit 14 outputs the element value x ′ _{q (t)} (t) of the composite signal vector x ′ _{q (t) to} the composite signal generation unit 15 as the signal value of the composite signal at time t.

（インパルス応答の生成モデル）
次に、インパルス応答の生成モデルの例について説明する。
インパルス応答の生成モデルとして、音源位置と収音位置（もしくは、音源位置を基準とする収音方向）に応じてインパルス応答を一意に定めることができる数理モデルであれば、いかなる生成モデルも利用可能である。
インパルス応答取得部１４２は、インパルス応答の生成モデルとして、例えば、幾何学的音響伝搬モデルを利用することができる。簡素な音響伝搬モデルのうちの一つとして球面波モデルが利用可能である。球面波モデルは、収音位置ｑにおける音圧が、音源位置から収音位置までの距離ｒに反比例して減衰し、音源位置における時刻から伝搬時間ｔ_ｐだけ遅延することを表すモデルである。伝搬時間ｔ_ｐは、距離ｒを音速ｖで除算して得られる。 (Impulse response generation model)
Next, an example of an impulse response generation model will be described.
As a generation model of the impulse response, any generation model can be used as long as it is a mathematical model that can uniquely determine an impulse response according to a sound source position and a sound pickup position (or a sound pickup direction based on the sound source position). It is.
The impulse response acquisition unit 142 can use, for example, a geometric sound propagation model as a generation model of the impulse response. A spherical wave model can be used as one of the simple acoustic propagation models. Spherical wave model, sound pressure at the sound collecting position q is attenuated in inverse proportion to the distance r from the sound source position to the sound collection position, a model representing the delaying from the time at the sound source position by the propagation time t _p. Propagation time t _p is obtained by dividing the distance r at the speed of sound v.

また、インパルス応答取得部１４２は、予め複数の受音点のそれぞれに対して実測された伝達関数を補間して、収音位置における伝達関数を算出してもよい。周波数領域で算出される伝達関数に対して逆フーリエ変換を行うことにより、時間領域のインパルス応答が得られる。複数の伝達関数を補間する手法として、ＦＤＬＩ（ＦｒｅｑｕｅｎｃｙＤｏｍａｉｎＬｉｎｅａｒｏｒｂｉ−ｌｉｎｅａｒＩｎｔｅｒｐｏｌａｔｉｏｎ）法、ＴＤＬＩ（ＴｉｍｅＤｏｍａｉｎＬｉｎｅａｒｉｎｔｅｒｐｏｌａｔｉｏｎ）法、ＦＴＤＬＩ（ＦｒｅｑｕｅｎｃｙＴｉｍｅＤｏｍａｉｎＬｉｎｅａｒｏｒｂｉ−ｌｉｎｅａｒＩｎｔｅｒｐｏｌａｔｉｏｎ）法などのいずれの手法が用いられてもよい。ＦＤＬＩ法とは、２以上の受音点間において、それぞれの受音点に対する伝達関数を周波数領域で線形補間して、収音位置に対する伝達関数を算出する手法である。ＴＤＬＩ法とは、時間領域で２以上の受音点間において、それぞれの受音点に対する伝達関数を時間領域で線形補間して、収音位置に対する伝達関数を算出する手法である。ＦＴＤＬＩ法は、時間領域で２以上の受音点間において、それぞれの受音点に対する伝達関数の位相を周波数領域で線形補間し、振幅を時間領域で線形補間する手法である。 Further, the impulse response acquisition unit 142 may calculate the transfer function at the sound collection position by interpolating the transfer function actually measured for each of the plurality of sound receiving points in advance. By performing an inverse Fourier transform on the transfer function calculated in the frequency domain, an impulse response in the time domain can be obtained. As a method of interpolating a plurality of transfer functions, an FDLI (Frequency Domain Linear or bi-linear Interpolation) method, a TDLI (Time Domain Linear Interpolation) method, and an FTDLI (Frequency Timeline Inner method) May be used. The FDLI method is a method of linearly interpolating a transfer function for each sound receiving point in a frequency domain between two or more sound receiving points and calculating a transfer function for a sound pickup position. The TDLI method is a method of linearly interpolating a transfer function for each sound receiving point in a time domain between two or more sound receiving points in a time domain and calculating a transfer function for a sound pickup position. The FTDLI method is a method of linearly interpolating the phase of a transfer function for each sound receiving point in the frequency domain between two or more sound receiving points in the time domain and linearly interpolating the amplitude in the time domain.

また、インパルス応答取得部１４２は、インパルス応答の生成モデルとして、音源位置から放射される音波の伝搬を表す波動方程式から導出されたモデルを用いてもよい。波動方程式から導出されるグリーン関数は、音源位置から収音位置までの伝達特性を示すインパルス応答として利用することができる。Ｈａｂｅｔｓが提案した室内インパルス応答生成法では、直方体の形状を有する室の壁面における音の反射特性を境界条件として導出されるグリーン関数がインパルス応答として採用されている。Ｈａｂｅｔｓが提案した手法については、例えば、次の文献に詳しく記載されている。
Ｈａｂｅｔｓ，Ｅ．Ａ．（２００６）．Ｒｏｏｍｉｍｐｕｌｓｅｒｅｓｐｏｎｓｅｇｅｎｅｒａｔｏｒ．ＴｅｃｈｎｉｓｃｈｅＵｎｉｖｅｒｓｉｔｅｉｔＥｉｎｄｈｏｖｅｎ，Ｔｅｃｈ．Ｒｅｐ．２（２．４），１． Further, the impulse response acquisition unit 142 may use a model derived from a wave equation representing the propagation of a sound wave radiated from the sound source position, as a model for generating the impulse response. The Green function derived from the wave equation can be used as an impulse response indicating a transfer characteristic from a sound source position to a sound pickup position. In the room impulse response generation method proposed by Habats, a Green function derived as a boundary condition using sound reflection characteristics on a wall of a room having a rectangular parallelepiped shape is adopted as an impulse response. The method proposed by Havets is described in detail in, for example, the following document.
Havets, E .; A. (2006). Room impulse response generator. Technische Universitysite Eindhoven, Tech. Rep. 2 (2.4), 1.

（第２の実施形態）
次に、本発明の第２の実施形態について説明する。以下の説明では、第１の実施形態との差異点を主とする。第１の実施形態と共通の処理、構成については、同一の符号を付して、特に断らない限りその説明を援用する。
図４は、本実施形態に係る音響処理装置１の構成を示す概略図である。
音響処理装置１は、音源信号取得部１１、収音位置取得部１２、収音位置離散化部１３、シミュレーション部１４、合成信号生成部１５、音源位置取得部１６および音源位置離散化部１７を含んで構成される。即ち、図４に示す音響処理装置１は、図１に示す音響処理装置１に対して、さらに音源位置取得部１６と音源位置離散化部１７を備える。 (Second embodiment)
Next, a second embodiment of the present invention will be described. In the following description, differences from the first embodiment will be mainly described. The same processes and configurations as those in the first embodiment are denoted by the same reference numerals, and the description will be referred to unless otherwise specified.
FIG. 4 is a schematic diagram illustrating a configuration of the sound processing device 1 according to the present embodiment.
The sound processing device 1 includes a sound source signal acquisition unit 11, a sound collection position acquisition unit 12, a sound collection position discretization unit 13, a simulation unit 14, a synthesized signal generation unit 15, a sound source position acquisition unit 16, and a sound source position discretization unit 17 It is comprised including. That is, the sound processing device 1 illustrated in FIG. 4 further includes a sound source position acquisition unit 16 and a sound source position discretization unit 17 in addition to the sound processing device 1 illustrated in FIG.

音源位置取得部１６は、シミュレーションの条件の他の要素である音源位置を示す音源位置信号を取得する。本実施形態では、収音位置の他、音源位置も時間の経過に応じて変動しうる。音源位置取得部１６は、例えば、所定の音源位置の移動パターンを示す音源位置信号を生成する。音源位置取得部１６は、操作部（図示せず）から入力される操作信号で逐次に指示される音源位置を示す音源位置信号を生成してもよい。音源位置取得部１６は、生成した音源位置信号を音源位置離散化部１７に出力する。 The sound source position acquisition unit 16 acquires a sound source position signal indicating a sound source position which is another element of the simulation condition. In this embodiment, in addition to the sound pickup position, the sound source position can also change over time. The sound source position acquisition unit 16 generates a sound source position signal indicating a movement pattern of a predetermined sound source position, for example. The sound source position acquisition unit 16 may generate a sound source position signal indicating a sound source position sequentially designated by an operation signal input from an operation unit (not shown). The sound source position acquisition unit 16 outputs the generated sound source position signal to the sound source position discretization unit 17.

音源位置離散化部１７は、音源位置取得部１６から入力される音源位置信号が示す音源位置を所定のサンプリング周波数に対応する時間間隔でサンプリングすることにより離散化する。このサンプリング周波数は、音響信号のサンプリング周波数と等しい周波数である。入力される音源位置信号は、音響信号のサンプリング時刻とは異なる離散化時刻ごとの音源位置を示すディジタル信号でありうる。その場合には、音源位置離散化部１７は、音源位置信号が示す時刻ごとの収音位置を補間して、そのサンプリング周波数に対応する時間間隔で離散化された時刻ごとに収音位置を算出する。音源位置離散化部１７は、離散化された音源位置信号をシミュレーション部１４に出力する。 The sound source position discretization unit 17 discretizes the sound source position indicated by the sound source position signal input from the sound source position acquisition unit 16 by sampling at a time interval corresponding to a predetermined sampling frequency. This sampling frequency is a frequency equal to the sampling frequency of the acoustic signal. The input sound source position signal may be a digital signal indicating the sound source position at each discrete time different from the sampling time of the acoustic signal. In that case, the sound source position discretizing unit 17 interpolates the sound pickup position for each time indicated by the sound source position signal, and calculates the sound pickup position for each time discretized at a time interval corresponding to the sampling frequency. I do. The sound source position discretization unit 17 outputs the discretized sound source position signal to the simulation unit 14.

シミュレーション部１４には、音源信号取得部１１から音源信号が入力され、収音位置離散化部１３から収音位置信号が入力される他、音源位置離散化部１７から音源位置信号が入力される。
インパルス応答取得部１４２は、モデルデータを用いて、離散化された時刻ごとに音源位置から収音位置信号が示す収音位置までのインパルス応答を生成する。本実施形態では、時刻ごとの音源位置、収音位置は、入力された音源位置信号、収音位置信号でそれぞれ指示される。従って、生成されるインパルス応答ｈ”_{ｑ（ｔ）ｐ（ｔ）}は、収音位置ｑ（ｔ）と音源位置ｐ（ｔ）に依存する。
シミュレーション部１４は、後述するシミュレーション方法に従い、時刻ｔごとに生成したインパルス応答を構成する応答係数を用いて、音源信号に対して畳み込み演算を行う。シミュレーション部１４は、畳み込み演算により得られた各時刻ｔで収音位置ｑ（ｔ）において収音されうる音の信号値ｘ”（ｔ）を合成信号生成部１５に出力する。 The simulation unit 14 receives a sound source signal from the sound source signal acquisition unit 11, a sound collection position signal from the sound collection position discretization unit 13, and a sound source position signal from the sound source position discretization unit 17. .
The impulse response acquisition unit 142 generates an impulse response from the sound source position to the sound pickup position indicated by the sound pickup position signal at each discretized time using the model data. In the present embodiment, the sound source position and the sound pickup position for each time are indicated by the input sound source position signal and the sound pickup position signal, respectively. Therefore, the generated impulse response h ″ _{q (t) p (t)} depends on the sound pickup position q (t) and the sound source position p (t).
The simulation unit 14 performs a convolution operation on the sound source signal using a response coefficient included in the impulse response generated at each time t according to a simulation method described later. The simulation unit 14 outputs the signal value x ″ (t) of the sound that can be collected at the sound collection position q (t) at each time t obtained by the convolution operation to the composite signal generation unit 15.

（シミュレーション方法）
次に、本実施形態に係るシミュレーション方法について説明する。
図５は、時刻ｔにおけるインパルス応答ｈ”_{ｐ（ｔ）ｑ（ｔ）}を示す。インパルス応答ｈ”_{ｐ（ｔ）ｑ（ｔ）}は、音源位置ｐ（ｔ）から収音位置ｑ（ｔ）までの音の伝達特性を示す。時刻ｔにおけるインパルス応答ｈ”_{ｐ（ｔ）ｑ（ｔ）}は、第０次の応答係数ｈ”_{ｐ（ｔ）ｑ（ｔ）}（０）から第Ｎ−１次の応答係数ｈ”_{ｐ（ｔ）ｑ（ｔ）}（Ｎ−１）をそれぞれ要素として有するＮ次元のベクトル[ｈ”_{ｐ（ｔ）ｑ（ｔ）}（０），ｈ”_{ｐ（ｔ）ｑ（ｔ）}（１），ｈ”_{ｐ（ｔ）ｑ（ｔ）}（２），…，ｈ”_{ｐ（ｔ）ｑ（ｔ）}（Ｎ−１）]^Ｔとして表される。 (Simulation method)
Next, a simulation method according to the present embodiment will be described.
FIG. 5 shows an impulse response h ″ _{p (t) q (t)} at time t. The impulse response h ″ _{p (t) q (t)} is shifted from the sound source position p (t) to the sound pickup position q (t). This shows the transmission characteristics of sound up to this point. The impulse response h ″ _{p (t) q (t) at} time t is calculated from the 0th-order response coefficient h ″ _{p (t) q (t)} (0) to the (N−1) th-order response coefficient h ″ _{p (t ) Q (t)} (N−1) as an N-dimensional vector [h ″ _{p (t) q (t)} (0), h ″ _{p (t) q (t)} (1), h ″ _{p (t) q (t)} (2),..., h ″ _{p (t) q (t)} (N−1)] ^T.

本実施形態では、音源位置ｐ（ｔ）、収音位置ｑ（ｔ）の両者が時間経過により変化しうるため、インパルス応答ｈ”_{ｐ（ｔ）ｑ（ｔ）}も時間経過に伴って変化しうる。収音位置ｑ（ｔ）の変動に対しては、シミュレーション部１４は、畳み込み演算において、過去の時刻ｔ−τにおける音源信号の信号値ｓ（ｔ−τ）の現時刻ｔにおける信号値ｘ”_ｑ（ｔ）（ｔ）に対する寄与率として、時刻ｔ−τにおける収音位置ｑ（ｔ−τ）に対する第τ次の応答係数ｈ”_{ｐ（ｔ）ｑ（ｔ−τ）}（τ）を用いればよい。ここで、音源位置ｐ（ｔ）の変動に関しては、各時刻ｔにおける音源位置ｐ（ｔ）に配置された音源Ｓｒが信号値ｓ（ｔ）に基づく音を放射し、その他の時刻ｔ−τにおける音源位置ｐ（ｔ−τ）に配置された音源が音を放射していないと仮定する。 In this embodiment, since both the sound source position p (t) and the sound pickup position q (t) can change over time, the impulse response h ″ _{p (t) q (t)} also changes over time. With respect to the fluctuation of the sound pickup position q (t), the simulation unit 14 performs, in the convolution operation, the signal value at the current time t of the signal value s (t−τ) of the sound source signal at the past time t−τ. As the contribution rate to x ″ _{q (t)} (t), the τ-th order response coefficient h ″ _{p (t) q (t−τ)} (τ) for the sound pickup position q (t−τ) at time t−τ. Here, regarding the fluctuation of the sound source position p (t), the sound source Sr disposed at the sound source position p (t) at each time t emits a sound based on the signal value s (t), and the like. The sound source located at the sound source position p (t−τ) at time t−τ does not emit sound. Assume that.

そこで、シミュレーション部１４は、現時刻ｔにおける音源位置ｐ（ｔ）と収音位置ｑ（ｔ）に対応するインパルス応答の第０応答係数ｈ”_{ｐ（ｔ）ｑ（ｔ）}（０）から、現時刻ｔから（Ｎ−１）時刻前の時刻ｔ−Ｎ＋１における音源位置ｐ（ｔ−Ｎ＋１）と現時刻ｔにおける収音位置ｑ（ｔ）に対応するインパルス応答の第Ｎ−１応答係数ｈ”_{ｐ（ｔ−Ｎ＋１）ｑ（ｔ））}（Ｎ−１）までのＮ個の応答係数のそれぞれを、現時刻ｔにおける音源信号の信号値ｓ（ｔ）から、現時刻ｔから第Ｎ−１時刻前における信号値ｓ（ｔ−Ｎ＋１）までのそれぞれに乗算して得られる乗算値の総和を、現時刻ｔにおける収音位置ｑ（ｔ）で収音されうる音の信号値ｘ”_ｑ（ｔ）（ｔ）として算出する。 Therefore, the simulation unit 14 calculates the 0th response coefficient h ″ _{p (t) q (t)} (0) of the impulse response corresponding to the sound source position p (t) and the sound pickup position q (t) at the current time t. The (N−1) th response coefficient h of the impulse response corresponding to the sound source position p (t−N + 1) at time t−N + 1 before (N−1) time before the current time t and the sound pickup position q (t) at current time t " _{P (t-N + 1) q (t))} Each of the N response coefficients up to (N-1) is calculated from the signal value s (t) of the sound source signal at the current time t to the N-th The sum of the multiplied values obtained by multiplying each of the signal values up to the signal value s (t−N + 1) one time before is represented by the signal value x ″ _{q of the} sound that can be collected at the sound collection position q (t) at the current time t. _(T) Calculated as (t).

本実施形態では、シミュレーション部１４は、式（３）に示すように、音源信号ベクトルｓに、インパルス応答行列Ｈ”_ｑ（ｔ）を乗じて合成信号ベクトルｘ”_ｑ（ｔ）を算出することができる。 In the present embodiment, the simulation unit 14 calculates the composite signal vector x ″ _{q (t)} by multiplying the sound source signal vector s by the impulse response matrix H ″ _{q (t)} as shown in Expression (3). Can be.

合成信号ベクトルｘ”_ｑ（ｔ）は、［ｘ”_ｑ（０）（０），ｘ”_ｑ（１）（１），ｘ”_ｑ（２）（２），…，ｘ”_ｑ（ｔ）（ｔ），…，ｘ”_{ｑ（Ｔ＋Ｎ−２）}（Ｔ＋Ｎ−２）］^Ｔと表される。
インパルス応答行列Ｈ”_ｑ（ｔ）は、［ｈ”_０，ｈ”_１，ｈ”_２，…，ｈ”_ｔ，…，ｈ”_{Ｔ＋Ｎ−２}］^Ｔと表される。
要素ベクトルｈ”_ｔは、それぞれ次式で表されるＴ次元の行ベクトルである。 The composite signal vector x ″ _{q (t)} is [x ″ _{q (0)} (0), x ″ _{q (1)} (1), x ″ _{q (2)} (2),..., X ″ _{q (t)} _{(t), ..., x "} q (T + N-2) (T + N-2)] it is expressed as ^T.
The impulse response matrix H " _{q (t)} is represented as [h" ₀ , h " ₁ , h" ₂ , ..., h " _t , ..., h" _{T + N-2} ] ^T.
Each of the element vectors h ″ _t is a T-dimensional row vector represented by the following equation.

従って、インパルス応答取得部１４２は、ステップＳ１０４(図３)において、モデルデータを用いて、各時刻ｔ_１（ｔ_１は、０からＴ−１までの整数）における音源位置ｐ（ｔ_１）と各時刻ｔ_２（ｔ_２は、０からＴ＋Ｎ−２までの整数）における収音位置ｑ（ｔ_２）との組にそれぞれ対応するインパルス応答ｈ”_{ｐ（ｔ１）ｑ（ｔ２）}を生成すればよい。 Thus, the impulse response obtaining unit 142, at step S104 (FIG. 3), using the model data, each time _t 1 _{(t 1} is an integer from 0 to T-1) sound source position p _{(t 1)} in the If an impulse response h ″ _{p (t1) q (t2)} corresponding to each pair with the sound pickup position q (t ₂ ) at each time t ₂ (t ₂ is an integer from 0 to T + N−2 ₎ , Good.

シミュレーション部１４は、ステップＳ１０６（図３において）、生成したインパルス応答ｈ”_{ｐ（ｔ１）ｑ（ｔ２）}からインパルス応答行列Ｈ”を構成する。インパルス応答行列Ｈ”を構成する際、第０行から第Ｎ−２行までの第ｔ行において、シミュレーション部１４は、時刻０における音源位置ｐ（０）と時刻ｔにおける収音位置ｑ（ｔ）に対応するインパルス応答の第ｔ応答係数ｈ”_{ｐ（０）ｑ（ｔ）}（ｔ）から時刻ｔにおける音源位置ｐ（ｔ）と時刻ｔにおける収音位置ｑ（ｔ）に対応するインパルス応答の第０応答係数ｈ”_{ｐ（ｔ）ｑ（ｔ）}（０）までのｔ＋１個の応答係数と、Ｔ−（ｔ＋１）個の０を各列の要素値として、その順序で配列する。
第Ｎ−１行から第Ｔ−１行までの第ｔ行において、シミュレーション部１４は、ｔ−Ｎ＋１個の０と、時刻ｔ−Ｎ＋１における音源位置ｐ（ｔ−Ｎ＋１）と時刻ｔにおける収音位置ｑ（ｔ）に対応するインパルス応答の第Ｎ−１応答係数ｈ” _{ｐ（ｔ−Ｎ＋１）ｑ（ｔ）}（Ｎ−１）から時刻ｔにおける音源位置ｐ（ｔ）と時刻ｔにおける収音位置ｑ（ｔ）に対応するインパルス応答の第０応答係数ｈ” _{ｐ（ｔ）ｑ（ｔ）}（０）までのＮ個の応答係数と、Ｔ−（ｔ＋１）個の０を各列の要素値として、その順序で配列する。
第Ｔ行から第Ｔ＋Ｎ−２行までの第ｔ行において、シミュレーション部１４は、Ｔ−Ｎ＋１個の０と、時刻ｔ−Ｎ＋１における音源位置ｐ（ｔ−Ｎ＋１）と時刻ｔにおける収音位置ｑ（ｔ）に対応するインパルス応答の第Ｎ−１応答係数ｈ” _{ｐ（ｔ−Ｎ＋１）ｑ（ｔ）}（Ｎ−１）から時刻Ｔ−１における音源位置ｐ（Ｔ−１）と時刻ｔにおける収音位置ｑ（ｔ）に対応するインパルス応答の第ｔ−Ｔ＋１応答係数ｈ”_{ｐ（Ｔ−１）ｑ（ｔ）}（ｔ−Ｔ＋１））までのＴ＋Ｎ−（ｔ＋１）個の応答係数を、各列の要素値として、その順序で配列する。 The simulation unit 14 constructs an impulse response matrix H ″ from the generated impulse response h ″ _{p (t1) q (t2)} in step S106 (in FIG. 3). When constructing the impulse response matrix H ″, in the t th row from the 0 th row to the N−2 th row, the simulation unit 14 sets the sound source position p (0) at time 0 and the sound pickup position q (t )), The impulse response corresponding to the sound source position p (t) at time t and the sound pickup position q (t) at time t from the t-th response coefficient h ″ _{p (0) q (t)} (t) of the impulse response corresponding to The t + 1 response coefficients up to the 0th response coefficient h ″ _{p (t) q (t)} (0) and the T− (t + 1) 0s are arrayed in that order as element values of each column.
In the t-th row from the (N−1) -th row to the (T−1) -th row, the simulation unit 14 collects t−N + 1 zeros, the sound source position p (t−N + 1) at time t−N + 1, and the sound pickup at time t. The sound source position p (t) at time t and the sound pickup at time t from the (N−1) th response coefficient h ″ _{p (t−N + 1) q (t)} (N−1) of the impulse response corresponding to the position q (t) The 0th response coefficient h ″ of the impulse response corresponding to the position q (t), N response coefficients up to _{p (t) q (t)} (0), and T− (t + 1) 0s are elements of each column. The values are arranged in that order.
In the t-th row from the T-th row to the (T + N-2) -th row, the simulation unit 14 calculates T−N + 1 0s, the sound source position p (t−N + 1) at time t−N + 1, and the sound pickup position q at time t. From the (N−1) th response coefficient h ″ _{p (t−N + 1) q (t)} (N−1) of the impulse response corresponding to (t), the sound source position p (T−1) at time T−1 and the time t T + N- (t + 1) response coefficients up to t-T + 1th response coefficient h " _{p (T-1) q (t)} (t-T + 1)) of the impulse response corresponding to the sound pickup position q (t) Arrange in that order as the element value of each column.

（評価実験）
上記の実施形態の音響処理方法の有効性を検証するために出願人は２項目の評価実験を行った。実験１では、合成信号のドップラー効果の再現性について検証した。実験１では、音源となるスピーカと収音部となるマイクロホンの位置関係として、次の移動パターン（ａ）〜（ｃ）を設定した。
パターン（ａ）当初、音源位置から収音位置までの距離を１８．７４ｍに設定しておき、収音位置を静止させたまま、音源位置を収音位置に秒速４０ｍ／ｓの速度で接近させた。
パターン（ｂ）当初、音源位置から収音位置までの距離を８．５ｍに設定しておき、音源位置を静止させたまま、収音位置を音源位置に秒速４０ｍ／ｓの速度で接近させた。
パターン（ｃ）当初、音源位置から収音位置までの距離を２６．７４ｍに設定しておき、音源位置と収音位置が互いに接近する方向に、それぞれ秒速４０ｍ／ｓの速度で接近させた。従って、合成信号の生成において、パターン（ａ）、（ｃ）については、第２の実施形態を適用し、パターン（ｂ）については、第１の実施形態を適用した。
合成信号の生成には、期間が０．２ｓの音源信号と長さが０．２５６ｓのインパルス応答を用いた。インパルス応答の生成において、Ｈａｂｅｔｓが提案した手法を用いた。但し、音速３４０ｍ／ｓ、サンプリング周波数８ｋＨｚ、残響時間０．２ｓおよび反射次数０次を仮定した。また、マイクロホンの指向特性として無指向性を仮定した。
検証結果の有効性を評価するために、合成信号の周波数と、収音信号の周波数の理論値とを比較した。ドップラー効果によれば、収音信号の周波数の理論値ｆ’は、式（５）に示すように、音源信号の周波数ｆに対して、音速Ｖと音源位置の移動速度ｖ_ｓとの差に対する音速Ｖと収音位置の移動速度ｖ_ｏとの和の比を乗じて得られる周波数となる。 (Evaluation experiment)
In order to verify the effectiveness of the sound processing method according to the above-described embodiment, the applicant performed two items of evaluation experiments. In Experiment 1, the reproducibility of the Doppler effect of the synthesized signal was verified. In Experiment 1, the following movement patterns (a) to (c) were set as the positional relationship between a speaker serving as a sound source and a microphone serving as a sound pickup unit.
Pattern (a) Initially, the distance from the sound source position to the sound pickup position is set to 18.74 m, and the sound source position is approached to the sound pickup position at a speed of 40 m / s per second while the sound pickup position is stationary. Was.
Pattern (b) Initially, the distance from the sound source position to the sound pickup position was set to 8.5 m, and the sound pickup position was approached to the sound source position at a speed of 40 m / s while the sound source position was kept still. .
Pattern (c) Initially, the distance from the sound source position to the sound pickup position was set to 26.74 m, and the sound source position and the sound pickup position were approached each other at a speed of 40 m / s per second. Therefore, in the generation of the composite signal, the second embodiment is applied to the patterns (a) and (c), and the first embodiment is applied to the pattern (b).
To generate the composite signal, a sound source signal having a period of 0.2 s and an impulse response having a length of 0.256 s were used. In generating the impulse response, the method proposed by Havets was used. However, it was assumed that the sound velocity was 340 m / s, the sampling frequency was 8 kHz, the reverberation time was 0.2 s, and the reflection order was 0. In addition, omnidirectionality was assumed as the directional characteristic of the microphone.
In order to evaluate the effectiveness of the verification result, the frequency of the synthesized signal was compared with the theoretical value of the frequency of the collected signal. According to the Doppler effect, the theoretical value f of the frequency of the sound collection signal ', as shown in equation (5), with respect to the frequency f of the excitation signal, for the difference between the moving velocity v _s of sound velocity V and the sound source position The frequency is obtained by multiplying the ratio of the sum of the sound speed V and the moving speed _vo of the sound pickup position.

パターン（ａ）では、理論値は、１１３３．３３Ｈｚであるのに対し、合成信号の周波数は、１１３３．４２Ｈｚとなった。パターン（ｂ）では、理論値は、１１１７．６５Ｈｚであるのに対し、合成信号の周波数は、１１１７．７１Ｈｚとなった。パターン（ｃ）では、理論値は、１２６６．６７Ｈｚであるのに対し、合成信号の周波数は、１２６６．８４Ｈｚとなった。パターン（ａ）〜（ｃ）ともに、合成信号の周波数の理論値との差分は、０．１４Ｈｚ以下に過ぎない。従って、実験１の結果は、音源位置や収音位置の移動に伴う周波数の変化が十分に再現できることを示す。 In the pattern (a), the theoretical value was 1133.33 Hz, whereas the frequency of the synthesized signal was 1133.42 Hz. In the pattern (b), the theoretical value was 1117.65 Hz, whereas the frequency of the combined signal was 1117.71 Hz. In the pattern (c), the theoretical value was 1266.67 Hz, whereas the frequency of the synthesized signal was 1266.64 Hz. In all of the patterns (a) to (c), the difference between the frequency of the synthesized signal and the theoretical value is only 0.14 Hz or less. Therefore, the result of Experiment 1 shows that the change in frequency due to the movement of the sound source position or the sound pickup position can be sufficiently reproduced.

実験２では、合成信号の音量について検証した。検証において、現実に音源から発した音を収音して得られる収音信号の音量と合成信号の音量とを比較した。音源信号として英文誌ウォールストリートジャーナル（ＷＳＪ：ＷａｌｌＳｔｒｅｅｔＪｏｕｒｎａｌ）の原稿のうち１０個の文を発話内容とする音声を用いた。
収音信号は、無響室内でスピーカとマイクロホンの一方もしくは両方を移動させながら収録した。無響室の内部は、縦６．２ｍ、横４．８ｍ、高さ５．１ｍの直方体の空間である。スピーカは、無響室の中心部を中心位置とする縦方向に４．0ｍの範囲を経路として移動させた。但し、スピーカを静止させる場合には、その経路の中心位置に静止させた。マイクロホンは、無響室の中心部から横方向に１．０ｍ離れた位置を中心位置とする縦方向に４．０ｍの範囲を経路として移動させた。但し、マイクロホンを静止させる場合には、その経路の中心位置に静止させた。スピーカとマイクロホンの位置関係として、次の移動パターン（ｉ）〜（ｖ）を設定した。 In Experiment 2, the volume of the synthesized signal was verified. In the verification, the volume of the collected signal obtained by actually collecting the sound emitted from the sound source was compared with the volume of the synthesized signal. As a sound source signal, a voice having utterance contents of ten sentences in a manuscript of a Wall Street Journal (WSJ) was used.
The picked-up signal was recorded while moving one or both of the speaker and the microphone in the anechoic room. The interior of the anechoic room is a rectangular parallelepiped space having a length of 6.2 m, a width of 4.8 m, and a height of 5.1 m. The loudspeaker was moved along a range of 4.0 m in the vertical direction with the center of the anechoic room as the center position. However, when the speaker was stopped, the speaker was stopped at the center position of the path. The microphone was moved along a range of 4.0 m in the vertical direction with the center position being 1.0 m away from the center of the anechoic chamber in the horizontal direction. However, when the microphone was stopped, the microphone was stopped at the center position of the path. The following movement patterns (i) to (v) were set as the positional relationship between the speaker and the microphone.

パターン（ｉ）音源位置、収音位置をいずれも静止させた。
パターン（ｉｉ）音源位置を静止させながら、収音位置を経路の一端から他端まで一定速度１．８ｍ／ｓで移動させた。
パターン（ｉｉｉ）収音位置を静止させながら、音源位置を経路の一端から他端まで一定速度１．８ｍ／ｓで移動させた。
パターン（ｉｖ）音源位置と収音位置を、同じ方向でそれぞれの経路上を一端から他端まで一定速度で移動させた。但し、音源位置の移動速度を１．８ｍ／ｓとし、収音位置の移動速度を１．７ｍ／ｓとした。
パターン（ｖ）音源位置と収音位置を、同じ速度でそれぞれの経路上を一定速度１．８ｍ／ｓで移動させた。但し、音源位置と収音位置の移動方向は、互いに逆方向である。音源の移動開始位置はその経路の一端であるのに対し、収音位置の移動開始位置はその経路の他端である。従って、パターン（ｉ）に対する合成信号は、従来の手法と同様に音源位置と収音位置に対するインパルス応答を音源信号に対して畳み込み演算を行って得られる。パターン（ｉｉ）に対する合成信号は、第１の実施形態の手法を実行して得られる。パターン（ｉｉｉ）〜（ｖ）に対する合成信号は、第２の実施形態の手法を実行して得られる。 Pattern (i) Both the sound source position and the sound pickup position were stopped.
Pattern (ii) The sound pickup position was moved from one end of the path to the other end at a constant speed of 1.8 m / s while the sound source position was kept still.
Pattern (iii) The sound source position was moved from one end of the path to the other end at a constant speed of 1.8 m / s while the sound collection position was stopped.
Pattern (iv) The sound source position and the sound pickup position were moved at a constant speed from one end to the other end on each path in the same direction. However, the moving speed of the sound source position was 1.8 m / s, and the moving speed of the sound pickup position was 1.7 m / s.
Pattern (v) The sound source position and the sound pickup position were moved at the same speed on each route at a constant speed of 1.8 m / s. However, the moving directions of the sound source position and the sound pickup position are opposite to each other. The movement start position of the sound source is one end of the route, whereas the movement start position of the sound pickup position is the other end of the route. Therefore, the composite signal for the pattern (i) is obtained by performing a convolution operation on the sound source signal with the impulse response for the sound source position and the sound pickup position, as in the conventional method. A composite signal for the pattern (ii) is obtained by executing the method of the first embodiment. The composite signals for the patterns (iii) to (v) are obtained by executing the method of the second embodiment.

評価を行う前に、合成信号に対する増幅率Ａを定める。増幅率Ａは、合成信号全体に対する音量を収音信号全体に対する音量に合わせるためのパラメータである。
増幅率Ａは、式（６）に基づいて計算できる。 Before the evaluation, the amplification factor A for the combined signal is determined. The amplification factor A is a parameter for adjusting the volume of the entire synthesized signal to the volume of the entire collected signal.
The amplification factor A can be calculated based on equation (6).

ｘ_ｓ（ｆ，ｔ）は、第ｆフレームの時刻ｔにおける合成信号の信号値を示す。ｘ_ｒ（ｆ，ｔ）は、第ｆフレームの時刻ｔにおける収音信号の信号値を示す。Ｆ、Ｎは、それぞれフレーム数、フレーム内のサンプル数を示す。従って、増幅された合成信号全体の音量が収音信号の音量に全体として等しくする増幅率Ａ’が、増幅率Ａとして算出される。 x _s (f, t) indicates the signal value of the composite signal at time t of the f-th frame. _xr (f, t) indicates the signal value of the collected sound signal at time t of the f-th frame. F and N indicate the number of frames and the number of samples in the frame, respectively. Therefore, an amplification factor A ′ that makes the entire volume of the amplified composite signal equal to the volume of the collected signal as a whole is calculated as the amplification factor A.

つまり、式（６）に示す増幅率Ａは、式（７）に示す関数Ｃ（Ａ）を最小にするとの条件のもとで与えられる。 That is, the amplification factor A shown in Expression (6) is given under the condition that the function C (A) shown in Expression (7) is minimized.

関数Ｃ（Ａ）の増幅率Ａに対する導関数は、式（８）で与えられる。 The derivative of the function C (A) with respect to the amplification factor A is given by equation (8).

式（８）の両辺を０とおくと、式（９）の関係が得られる。 If both sides of equation (8) are set to 0, the relationship of equation (9) is obtained.

式（９）を変形すると、式（１０）が得られる。式（１０）を用いて増幅率Ａが算出される。 By transforming equation (9), equation (10) is obtained. The amplification factor A is calculated using the equation (10).

そして、合成信号ｘ_ｓ（ｆ，ｔ）に増幅率Ａを乗算して、補正合成信号ｘ’_ｓ（ｆ，ｔ）を算出する。次に、合成信号と収音信号の音量の類似性の尺度として、式（１１）を用いて距離Ｄ_ｓを算出する。 Then, the composite signal x _s (f, t) is multiplied by the amplification factor A to calculate a corrected composite signal x ′ _s (f, t). Then, as a measure of the similarity of the sound volume of the synthesized signal and the collected signal calculates the distance D _s using the equation (11).

距離Ｄ_ｓは、合成信号と収音信号の信号値のフレームごとの差の大きさを示す。評価において、距離Ｄ_ｓを音源位置と収音位置の時間変化に伴う両信号間の振幅変化の差の大きさを示す尺度として用いた。
なお、比較のために、式（１２）を用いて原信号ｘ’_ｏ（ｆ，ｔ）と収音信号との距離Ｄ_ｏを算出した。 The distance D _s indicates the magnitude of the difference for each frame of the signal value of the composite signal and the collected sound signal. In the evaluation, using the distance D _s as a measure indicating the magnitude of the difference of the change in amplitude between the two signals due to the time change of the sound source position and sound pickup position.
For comparison, it was calculated distance _{D o} of using Equation (12) the original signal x _'o (f, t) and the picked-up signal.

評価において、パターン（ｉ）〜（ｖ）のそれぞれについて、距離Ｄ_ｓと距離Ｄ_ｏを算出した。次に、距離Ｄ_ｓと距離Ｄ_ｏの算出例を示す。但し、次に示す算出例は、移動パターンごとの１０回の発話間の平均値である。
距離Ｄ_sは、パターン（ｉ）、（ｉｉ）、（ｉｉｉ）、（ｉｖ）、（ｖ）のそれぞれについて、０．０１１０、０．０１４７、０．０９６、０．０１２０、０．００８９となった。
距離Ｄ_oは、パターン（ｉ）、（ｉｉ）、（ｉｉｉ）、（ｉｖ）、（ｖ）のそれぞれについて、０．０１０８、０．０３０２、０．０３３５、０．０１３９、０．０３７２となった。
算出した距離Ｄ_ｓは、パターン（ｖ）、（ｉｉｉ）、（ｉ）、（ｉｖ）、（ｉｉ）の順に大きくなるが、いずれのパターンにかかわらず、約０．０１となり、音源位置と収音位置の相対速度との相関性も認められない。最も相対速度が大きい移動パターン（ｖ）でも距離Ｄ_Ｓは０．００８９に過ぎない。
他方、距離Ｄ_oは、パターン（ｉ）、（ｉｖ）、（ｉｉ）、（ｉｉｉ）、（ｖ）の順に大きくなる傾向がある。このことは、相対速度が大きいほど移動に伴う音量の変化が著しいことを裏付ける。パターン（ｉ）では、音源位置と収音位置の相対速度が０となり、移動パターン（ｉｖ）では、音源位置と収音位置の相対速度が０．１ｍ／ｓとなり、移動パターン（ｉｉ）、（ｉｉｉ）では、音源位置と収音位置の相対速度が１．８ｍ／ｓとなり、移動パターン（ｖ）では、音源位置と収音位置の相対速度が３．６ｍ／ｓとなる。移動パターン（ｉ）、（ｉｖ）のように音源位置と収音位置の相対速度が０や０と近い場合に、距離Ｄ_oと距離Ｄ_ｓが近似するに過ぎない。
従って、実験２の結果は、音源位置と収音位置の間の相対速度が高くなっても、移動に伴う音量の変化を再現できることを示す。 In the evaluation, for each pattern (i) ~ (v), was calculated distance _{D s} and the distance _{D o.} Next, an example of calculation of the distance _{D s} and the distance _{D o.} However, the following calculation example is an average value between 10 utterances for each movement pattern.
The distance D _s, the pattern (i), for each of the (ii), (iii), (iv), (v), a 0.0110,0.0147,0.096,0.0120,0.0089 Was.
The distance D _o, a pattern (i), for each of the (ii), (iii), (iv), (v), a 0.0108,0.0302,0.0335,0.0139,0.0372 Was.
Calculated distance _{D s,} the pattern (v), (iii), (i), (iv), but increases in the order of (ii), irrespective of any pattern, about 0.01, and the sound source position and the yield There is no correlation with the relative speed of the sound position. The most relative velocity is large moving pattern even (v) the distance _{D S} is only 0.0089.
On the other hand, the distance D _o, a pattern (i), tends to become larger in the order of (iv), (ii), (iii), (v). This confirms that the greater the relative speed, the more significant the change in the sound volume due to the movement. In the pattern (i), the relative speed between the sound source position and the sound pickup position is 0, and in the movement pattern (iv), the relative speed between the sound source position and the sound pickup position is 0.1 m / s, and the movement patterns (ii), ( In iii), the relative speed between the sound source position and the sound pickup position is 1.8 m / s, and in the movement pattern (v), the relative speed between the sound source position and the sound pickup position is 3.6 m / s. Movement pattern (i), when close to the relative speed is 0 and 0 of the sound source position and the voice collecting position as (iv), the distance D _o and the distance D _s is only approximate.
Therefore, the result of Experiment 2 shows that even if the relative speed between the sound source position and the sound pickup position is increased, the change in the volume due to the movement can be reproduced.

以上に説明した実施形態に係る音響処理装置１は、移動する収音部の位置である収音位置を所定時間間隔で離散化する収音位置離散化部１３と、音源位置から収音位置までの伝達特性を示すインパルス応答を取得するシミュレーション部１４を備える。インパルス応答は、時刻ごとに第０応答係数から第Ｎ−１応答係数までのＮ個の応答係数を含む。シミュレーション部１４は、現時刻ｔにおける第０応答係数から時刻ｔ−（Ｎ−１）までの第Ｎ−１応答係数までの応答係数と、音源が発する音響信号を所定時間間隔で離散化した信号値について、現時刻ｔにおける信号値から前記時刻ｔ−（Ｎ−１）における信号値までの信号値を用いて畳み込み演算を行って、収音位置における音響信号である合成信号を示す信号値を算出する。
この構成により、移動する収音部で収音される収音信号に近似する合成信号を容易に取得することができる。 The sound processing apparatus 1 according to the embodiment described above includes a sound pickup position discretizing unit 13 that discretizes a sound pickup position, which is a position of a moving sound pickup unit, at predetermined time intervals, and a sound pickup position to a sound pickup position. The simulation unit 14 acquires an impulse response indicating the transfer characteristic of The impulse response includes N response coefficients from the 0th response coefficient to the (N-1) th response coefficient at each time. The simulation unit 14 generates a response coefficient from a 0th response coefficient at the current time t to an N-1th response coefficient from time t- (N-1), and a signal obtained by discretizing an acoustic signal emitted from the sound source at predetermined time intervals. The convolution operation is performed on the value using the signal value from the signal value at the current time t to the signal value at the time t− (N−1), and the signal value indicating the synthesized signal that is the acoustic signal at the sound collection position is obtained. calculate.
With this configuration, it is possible to easily obtain a synthesized signal that is similar to a sound pickup signal picked up by the moving sound pickup unit.

また、音響処理装置１は、移動する音源位置を所定時間間隔で離散化する音源位置離散化部１７をさらに備えてもよい。シミュレーション部１４は、離散化した音源位置から収音位置までの伝達特性を示すインパルス応答を取得する。
この構成により、移動する音源から発される音に応じて収音される収音信号に近似する合成信号を容易に取得することができる。 In addition, the sound processing device 1 may further include a sound source position discretization unit 17 that discretizes a moving sound source position at predetermined time intervals. The simulation unit 14 acquires an impulse response indicating a transfer characteristic from the discretized sound source position to the sound pickup position.
With this configuration, it is possible to easily obtain a synthesized signal that is similar to a sound pickup signal picked up in accordance with a sound emitted from a moving sound source.

また、音響処理装置１は、シミュレーション部１４は、応答係数を要素値として含むＴ＋Ｎ−１行Ｔ列のシミュレーション行列を生成し、シミュレーション行列の第０行から第Ｎ−２行までの第ｔ行は、時刻ｔにおける収音位置に基づく第ｔ応答係数から時刻ｔにおける収音位置に基づく第０応答係数までの応答係数と、Ｔ−（ｔ＋１）個の０を各列の要素値として含む。シミュレーション行列の第Ｎ−１行から第Ｔ−１行までの第ｔ行は、Ｔ−Ｎ＋１個の０と、時刻ｔにおける収音位置に基づく第Ｎ−１応答係数から時刻ｔにおける収音位置に基づく第０応答係数までの応答係数と、Ｔ−（ｔ＋１）個の０を各列の要素値として含む。また、シミュレーション行列の第Ｔ行から第Ｔ＋Ｎ−２行までの第ｔ行は、ｔ−Ｎ＋１個の０と、時刻ｔにおける収音位置に基づく第Ｎ−１応答係数から時刻ｔにおける収音位置に基づく第ｔ−Ｔ＋１応答係数までの応答係数を各列の要素値として含む。そして、シミュレーション部１４は、時刻０における前記信号値から時刻Ｔ−１における信号値までの信号値を各行の要素値として含む音響信号ベクトルを生成し、生成したシミュレーション行列を音響信号ベクトルに乗算する。
この構成によれば、音源信号に基づく音響信号ベクトルに対する、音源位置と移動する音源位置に対応するインパルス応答の応答係数を要素値として含むインパルス応答行列の乗算により、収音信号ベクトルが得られる。そのため、複雑な演算を要さずに収音信号の信号値を容易に得ることができる。 Further, in the sound processing device 1, the simulation unit 14 generates a simulation matrix of T + N−1 rows and T columns including a response coefficient as an element value, and generates a t-th row from the 0th row to the N-2th row of the simulation matrix. Includes the response coefficients from the t-th response coefficient based on the sound collection position at time t to the 0th response coefficient based on the sound collection position at time t, and T− (t + 1) 0s as element values of each column. The t-th row from the (N-1) -th row to the T-1-th row of the simulation matrix has T-N + 1 zeros and the N-1th response coefficient based on the pickup location at the time t, and the pickup location at the time t. , And T- (t + 1) zeros as element values of each column. The t-th row from the T-th row to the (T + N−2) -th row of the simulation matrix includes t−N + 1 zeros and the N−1 response coefficient based on the sound collection position at time t, and the sound collection position at time t. Are included as the element values of each column up to the t-T + 1th response coefficient. Then, the simulation unit 14 generates an acoustic signal vector including signal values from the signal value at time 0 to the signal value at time T-1 as element values of each row, and multiplies the generated simulation matrix by the generated acoustic matrix. .
According to this configuration, a sound pickup signal vector is obtained by multiplying the acoustic signal vector based on the sound source signal by the impulse response matrix including, as element values, the response coefficients of the impulse response corresponding to the sound source position and the moving sound source position. Therefore, the signal value of the picked-up signal can be easily obtained without requiring a complicated operation.

なお、上述した実施形態における音響処理装置１の一部、例えば、収音位置離散化部１３、シミュレーション部１４、合成信号生成部１５および音源位置離散化部１７をコンピュータで実現するようにしてもよい。その場合、この制御機能を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することによって実現してもよい。なお、ここでいう「コンピュータシステム」とは、音響処理装置１に内蔵されたＣＰＵ等の１以上のプロセッサを備えるコンピュータシステムであって、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含んでもよい。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。
また、上述した実施形態における音響処理装置１の一部、または全部を、ＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ）等の集積回路として実現してもよい。音響処理装置１の各機能ブロックは個別にプロセッサ化してもよいし、一部、または全部を集積してプロセッサ化してもよい。また、集積回路化の手法はＬＳＩに限らず専用回路、または汎用プロセッサで実現してもよい。また、半導体技術の進歩によりＬＳＩに代替する集積回路化の技術が出現した場合、当該技術による集積回路を用いてもよい。 Note that a part of the sound processing apparatus 1 in the above-described embodiment, for example, the sound collection position discretization unit 13, the simulation unit 14, the synthesized signal generation unit 15, and the sound source position discretization unit 17 may be realized by a computer. Good. In this case, a program for realizing the control function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read and executed by a computer system. Here, the “computer system” is a computer system including one or more processors such as a CPU built in the sound processing apparatus 1 and includes an OS and hardware such as peripheral devices. The “computer-readable recording medium” refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, and a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, the "computer-readable recording medium" is a medium that dynamically holds the program for a short time, such as a communication line for transmitting the program through a network such as the Internet or a communication line such as a telephone line, In this case, a program holding a program for a certain period of time, such as a volatile memory in a computer system serving as a server or a client, may be included. Further, the above-mentioned program may be for realizing a part of the above-mentioned functions, and may be for realizing the above-mentioned functions in combination with a program already recorded in the computer system.
Further, a part or all of the sound processing device 1 in the above-described embodiment may be realized as an integrated circuit such as an LSI (Large Scale Integration). Each functional block of the sound processing device 1 may be individually implemented as a processor, or a part or all of the functional blocks may be integrated into a processor. The method of circuit integration is not limited to an LSI, and may be realized by a dedicated circuit or a general-purpose processor. Further, in the case where a technology for forming an integrated circuit that replaces the LSI appears due to the advance of the semiconductor technology, an integrated circuit based on the technology may be used.

以上、図面を参照してこの発明の一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、この発明の要旨を逸脱しない範囲内において様々な設計変更等をすることが可能である。
例えば、音響処理装置１は、音声認識に用いられる音響モデル学習部（図示せず）の一部として構成されてもよい。音響モデル学習部は、それぞれ音源位置の時系列を示す移動パターンごとに合成信号生成部１５が生成した合成信号を用いて音響モデルを生成する。音響モデル学習部は、生成された合成信号について所定の時間長（例えば、１０〜５０ｍｓ）のフレームごとに音響特徴量（例えば、ＭＦＣＣ（Ｍｅｌ−ｆｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ；メル周波数ケプストラム係数）を算出し、算出した音響信号を用いて、予め生成された既存の音響モデルに対する最大尤度線形回帰法（ＭＬＬＲ：ＭａｘｉｍｕｍＬｉｋｅｌｉｈｏｏｄＬｉｎｅａｒＲｅｇｒｅｓｓｉｏｎ）による更新処理を行う。既存の音響モデルは、例えば、音源位置と収音位置が固定された静的環境下で収音された発話音声を用いて学習された音響モデルとして、ＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ；混合ガウス分布モデル）、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ；ＨＭＭ）などが適用可能である。これにより、比較的少量の合成信号により、移動パターンごとの音響モデルを取得できる。音声認識装置は、移動パターンごとに生成された音響モデルを音声認識に用いることで、発話者や収音部の移動パターンに応じた発話音声の認識を向上させることができる。
また、音響処理装置１は、仮想的な音響環境における音源位置から収音位置に伝搬する音を示す合成信号を生成ならびに可聴化するための音響シミュレータとして構成されてもよい。 As described above, one embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to the above, and various design changes and the like can be made without departing from the gist of the present invention. It is possible to
For example, the sound processing device 1 may be configured as a part of an acoustic model learning unit (not shown) used for speech recognition. The acoustic model learning unit generates an acoustic model using the composite signal generated by the composite signal generation unit 15 for each movement pattern indicating a time series of the sound source position. The acoustic model learning unit calculates an acoustic feature value (for example, MFCC (Mel-frequency Cepstrum Coefficients; mel frequency cepstrum coefficients) for each frame having a predetermined time length (for example, 10 to 50 ms) for the generated synthesized signal, The calculated acoustic signal is used to perform an update process by a maximum likelihood linear regression (MLLR) for an existing acoustic model generated in advance. GMM (Gaussian Mixture Model; Gaussian Mixture Distribution Model), Hidden Markov Model (Hidden Markov Model) are used as acoustic models learned using uttered voices collected in a static environment where the position is fixed. Model; HMM), etc. Thereby, an acoustic model for each movement pattern can be acquired with a relatively small amount of synthesized signal.The speech recognition device uses the acoustic model generated for each movement pattern for speech recognition. By using this, it is possible to improve the recognition of the uttered voice according to the utterer and the movement pattern of the sound pickup unit.
Further, the sound processing device 1 may be configured as a sound simulator for generating and audible a synthesized signal indicating a sound propagating from a sound source position to a sound collection position in a virtual sound environment.

１…音響処理装置、１１…音源信号取得部、１２…収音位置取得部、１３…収音位置離散化部、１４…シミュレーション部、１５…合成信号生成部、１６…音源位置取得部、１７…音源位置離散化部 DESCRIPTION OF SYMBOLS 1 ... Sound processing apparatus, 11 ... Sound source signal acquisition part, 12 ... Sound collection position acquisition part, 13 ... Sound collection position discretization part, 14 ... Simulation part, 15 ... Synthetic signal generation part, 16 ... Sound source position acquisition part, 17 … Sound source position discretization unit

Claims

A sound collecting position discretizing unit that discretizes a sound collecting position, which is a position of a moving sound collecting unit, at predetermined time intervals;
Obtain an impulse response indicating a transfer characteristic from a sound source position to the sound pickup position,
The impulse response includes N (N is an integer greater than 1) response coefficients from a 0th response coefficient to an N-1th response coefficient at each time,
The response coefficient from the 0th response coefficient at the current time t to the (N-1) th response coefficient at a time t- (N-1) immediately before the current time t and the sound signal emitted by the sound source is determined by the predetermined value. Convolution operation is performed on the signal values discretized at the time intervals using the signal values from the signal value at the current time t to the signal value at the time t− (N−1), and the acoustic signal at the sound collection position is obtained. A simulation unit for calculating a signal value at the current time t shown,
A sound processing device comprising:

A sound source position discretization unit that discretizes the moving sound source position at predetermined time intervals,
The simulation unit includes:
The acoustic processing apparatus according to claim 1, wherein an impulse response indicating a transfer characteristic from the discretized sound source position to the sound pickup position is acquired.

The simulation unit includes:
Generating a simulation matrix of T + N−1 rows and T columns (T is an integer greater than N) including the response coefficient as an element value;
The t-th row from the 0th row to the N-2th row of the simulation matrix includes response coefficients from the t-th response coefficient based on the sound collection position at time t to the 0th response coefficient based on the sound collection position at time t. , T- (t + 1) 0s as element values of each column,
The t-th row from the (N−1) -th row to the T−1-th row of the simulation matrix includes the sound pickup at the time t based on the (T−N + 1) 0s and the (N−1) th response coefficient based on the sound pickup position at the time t. Including response coefficients up to the 0th response coefficient based on the position and T- (t + 1) 0s as element values of each column,
The t-th row from the T-th row to the (T + N−2) -th row of the simulation matrix includes t−N + 1 zeros and the (N−1) th response coefficient based on the sound pickup position at the time t to the sound pickup position at the time t. Response factors up to the (t-T + 1) th response factor based on
Generating an acoustic signal vector including a signal value from the signal value at time 0 to the signal value at time T-1 as an element value of each row;
The acoustic processing device according to claim 1, wherein the acoustic matrix is multiplied by the simulation matrix.

A sound processing method in a sound processing device,
The sound processing device,
A sound collection position discretization process of discretizing a sound collection position, which is a position of a moving sound collection unit, at predetermined time intervals;
Obtain an impulse response indicating a transfer characteristic from a sound source position to the sound pickup position,
The impulse response includes N (N is an integer greater than 1) response coefficients from a 0th response coefficient to an N-1th response coefficient at each time,
The response coefficient from the 0th response coefficient at the current time t to the (N-1) th response coefficient at a time t- (N-1) immediately before the current time t and the sound signal emitted by the sound source is determined by the predetermined value. Convolution operation is performed on the signal values discretized at the time intervals using the signal values from the signal value at the current time t to the signal value at the time t− (N−1), and the acoustic signal at the sound collection position is obtained. A simulation process of calculating a signal value at the current time t shown,
A sound processing method comprising:

In the computer of the sound processing device,
A sound collection position discretization procedure for discretizing a sound collection position, which is a position of a moving sound collection unit, at predetermined time intervals;
Obtain an impulse response indicating a transfer characteristic from a sound source position to the sound pickup position,
The impulse response includes N (N is an integer greater than 1) response coefficients from a 0th response coefficient to an N-1th response coefficient at each time,
The response coefficient from the 0th response coefficient at the current time t to the (N-1) th response coefficient at a time t- (N-1) immediately before the current time t and the sound signal emitted by the sound source is determined by the predetermined value. Convolution operation is performed on the signal values discretized at the time intervals using the signal values from the signal value at the current time t to the signal value at the time t− (N−1), and the acoustic signal at the sound collection position is obtained. A simulation procedure for calculating a signal value at the current time t shown,
A program for executing