JP6821390B2

JP6821390B2 - Sound processing equipment, sound processing methods and programs

Info

Publication number: JP6821390B2
Application number: JP2016208845A
Authority: JP
Inventors: 恭平北澤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2016-10-25
Filing date: 2016-10-25
Publication date: 2021-01-27
Anticipated expiration: 2036-10-25
Also published as: JP2018074252A

Description

本発明は音響処理装置、音響処理方法及びプログラムに関する。 The present invention relates to an acoustic processing apparatus, an acoustic processing method and a program.

空間を複数のエリアに分割してエリアごとの音声を取得する技術が知られている（特許文献１）。 A technique for dividing a space into a plurality of areas and acquiring sound for each area is known (Patent Document 1).

特開２０１４−７２７０８号公報Japanese Unexamined Patent Publication No. 2014-722708

しかしながら、複数のエリアに分割したエリアの音声をリアルタイム処理し、放送しようとすると、処理や伝送が間に合わずデータが欠損し、音声が途切れてしまう可能性があった。 However, when the audio of an area divided into a plurality of areas is processed in real time and an attempt is made to broadcast the audio, there is a possibility that the processing or transmission cannot be completed in time, data is lost, and the audio is interrupted.

本発明は上記課題に鑑みなされたものであり、空間を分割した複数のエリアから音声を取得して再生用信号を生成する構成において、処理の効率化を可能にする技術を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a technique capable of improving processing efficiency in a configuration in which audio is acquired from a plurality of areas in which a space is divided to generate a reproduction signal. And.

上記の目的を達成するため、本発明による音響処理装置は以下の構成を備える。即ち、
収音領域の音を収音する複数のマイクロホンによる収音に基づく収音信号を取得する取得手段と、
前記収音領域内において検出されたオブジェクトの位置に基づいて、前記収音領域内の複数の部分エリアの位置とサイズとを決定する決定手段と、
前記決定手段により決定された複数の部分エリアにそれぞれ対応する複数のエリア音響信号を、前記取得手段により取得された前記収音信号から抽出する抽出手段と、
指定された仮想聴取点の位置及び向きに応じた再生用音響信号を、前記抽出手段により抽出された２以上のエリア音響信号を用いた音響処理により生成する生成手段とを有し、
前記決定手段は、前記収音領域内において検出されたオブジェクトの位置を含む部分エリアのサイズが、前記収音領域内において検出されたオブジェクトの位置を含まない部分エリアのサイズよりも小さくなるように、前記複数の部分エリアのサイズを決定することを特徴とする。 In order to achieve the above object, the sound processing apparatus according to the present invention has the following configurations. That is,
An acquisition means for acquiring a sound collection signal based on sound collection by a plurality of microphones that collect sound in the sound collection area, and
A determination means for determining the position and size of a plurality of partial areas in the sound collection area based on the position of the object detected in the sound collection area.
An extraction means for extracting a plurality of area acoustic signals corresponding to the plurality of partial areas determined by the determination means from the sound collection signal acquired by the acquisition means, and an extraction means.
It has a generation means for generating a reproduction acoustic signal according to the position and orientation of a designated virtual listening point by acoustic processing using two or more area acoustic signals extracted by the extraction means.
The determining means makes the size of the partial area including the position of the object detected in the sound collecting area smaller than the size of the partial area including the position of the object detected in the sound collecting area. , The feature is to determine the size of the plurality of partial areas .

本発明によれば、空間を分割した複数のエリアから音声を取得して再生用信号を生成する構成において、処理を効率化することが可能になる。 According to the present invention, it is possible to improve the efficiency of processing in a configuration in which audio is acquired from a plurality of areas in which a space is divided to generate a reproduction signal.

音声信号処理装置の構成を示すブロック図。The block diagram which shows the structure of the audio signal processing apparatus. 分離エリア制御の説明図。Explanatory drawing of separation area control. 分離エリア制御の時間変化を表す説明図。Explanatory drawing which shows the time change of the separation area control. 音声信号処理装置のハードウェア構成を示すブロック図。The block diagram which shows the hardware composition of the audio signal processing apparatus. 音声信号処理を示すフローチャート。A flowchart showing audio signal processing. 分離エリア制御の表示装置を説明する図。The figure explaining the display device of the separation area control. 音響システムを説明する図。The figure explaining the sound system. 音響システムの構成の詳細を示すブロック図。A block diagram showing details of the configuration of an audio system. 分離エリア制御の説明図。Explanatory drawing of separation area control. 音声信号処理を示すフローチャート。A flowchart showing audio signal processing.

以下、本発明の実施形態について、図面を参照して説明する。なお、以下の実施形態は本発明を限定するものではなく、また、本実施形態で説明されている特徴の組み合わせの全てが本発明の解決手段に必須のものとは限らない。なお、同一の構成については、同じ符号を付して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. It should be noted that the following embodiments do not limit the present invention, and not all combinations of features described in the present embodiment are essential for the means for solving the present invention. The same configuration will be described with the same reference numerals.

＜実施形態１＞
本発明の第一の実施形態（実施形態１）では、音源分離処理がリアルタイム再生に間に合わなくなった場合に、使用する分離エリア数を低減する構成を説明する。 <Embodiment 1>
In the first embodiment (Embodiment 1) of the present invention, a configuration for reducing the number of separation areas to be used when the sound source separation process is not in time for real-time reproduction will be described.

（音声信号処理装置）
図１は音声信号処理装置１００の構成を示すブロック図である。音声信号処理装置１００は、マイクアレイにより所定の空間エリアから音声を収集し、収集した音声を複数の分離エリアに基づき複数の音声信号に分離して音声処理を行い、ミキシングを行って再生用信号を生成する装置である。音声信号処理装置１００はマイクアレイ１１１、音源分離部１１２、分離エリア制御部１１３、音声信号処理部１１４、記憶部１１５、リアルタイム再生用信号生成部１１６、及び、リプレイ再生用信号生成部１１７を備える。 (Audio signal processor)
FIG. 1 is a block diagram showing a configuration of an audio signal processing device 100. The voice signal processing device 100 collects voice from a predetermined space area by a microphone array, separates the collected voice into a plurality of voice signals based on a plurality of separated areas, performs voice processing, and mixes them to reproduce a signal for reproduction. Is a device that produces. The audio signal processing device 100 includes a microphone array 111, a sound source separation unit 112, a separation area control unit 113, an audio signal processing unit 114, a storage unit 115, a real-time reproduction signal generation unit 116, and a replay reproduction signal generation unit 117. ..

マイクアレイ１１１は複数のマイクロホンからなる。マイクアレイ１１１は、担当する空間の音声をマイクロホンで収集する。マイクアレイ１１１を構成する各マイクロホンがそれぞれ収音するため、マイクアレイ１１１が収集する音声は、全体として、各マイクロホンが収集した複数の音声からなるマルチチャネルの信号となる。マイクアレイ１１１は、空間の音声をマイクロホンで収音し、収音した信号をＡ／Ｄ変換（アナログ／デジタル変換）したのち、音源分離部１１２へ出力する。 The microphone array 111 is composed of a plurality of microphones. The microphone array 111 collects the sound of the space in charge with a microphone. Since each microphone constituting the microphone array 111 collects sound, the sound collected by the microphone array 111 is a multi-channel signal composed of a plurality of sounds collected by each microphone as a whole. The microphone array 111 collects the sound in the space with a microphone, performs A / D conversion (analog / digital conversion) on the collected signal, and then outputs the sound to the sound source separation unit 112.

音源分離部１１２、分離エリア制御部１１３、音声信号処理部１１４、リアルタイム再生用信号生成部１１６、リプレイ再生用信号生成部１１７は例えばＣＰＵ（中央演算処理装置）やＤＳＰ、ＭＰＵなどの演算処理装置からなる。ＤＳＰはDigital Signal Processorの略称であり、ＭＰＵはMicro-processing unitの略称である。 The sound source separation unit 112, separation area control unit 113, audio signal processing unit 114, real-time playback signal generation unit 116, and replay playback signal generation unit 117 are arithmetic processing units such as a CPU (central processing unit), DSP, and MPU. Consists of. DSP is an abbreviation for Digital Signal Processor, and MPU is an abbreviation for Micro-processing unit.

音源分離部１１２は、マイクアレイ１１１が収音を担当する空間をＮ（Ｎ＞１）個のエリア（以下、「分離エリア」）に分割した場合に、マイクアレイ１１１から入力された信号を各分離エリアにおける音声に分離する音源分離処理を行う。前述のように、マイクアレイ１１１から入力される信号は各マイクロホンが収集した複数の音声からなるマルチチャネルの信号である。そのため、マイクアレイ１１１を構成する各マイクロホンと集音したい分離エリアとの位置関係に基づき、マイクロホンが収集した音声信号に位相制御および重みづけをして加算することで、任意の分離エリアの音声を再現することができる。なお、本実施形態では、この分離エリアの配置が予め定められている例を説明する。音源分離部１１２はマイクアレイ１１１から入力された信号を用いて空間をＮ（Ｎ＞１）個のエリアに分割するように音源分離処理を行う。分離処理は処理フレームごと、つまり所定の時間間隔ごとに行われる。例えば、所定の時間ごとにビームフォーミング処理を行い、エリアごとの音声を取得する。分離して取得した音声は音声信号処理部１１４および記憶部１１５へ出力される。 When the space in which the microphone array 111 is in charge of collecting sound is divided into N (N> 1) areas (hereinafter, “separation area”), the sound source separation unit 112 receives signals input from the microphone array 111. Performs sound source separation processing to separate into audio in the separation area. As described above, the signal input from the microphone array 111 is a multi-channel signal composed of a plurality of voices collected by each microphone. Therefore, based on the positional relationship between each microphone constituting the microphone array 111 and the separation area to be collected, the sound signal collected by the microphone is phase-controlled and weighted and added to obtain the sound in an arbitrary separation area. It can be reproduced. In this embodiment, an example in which the arrangement of the separation area is predetermined will be described. The sound source separation unit 112 performs sound source separation processing so as to divide the space into N (N> 1) areas using the signal input from the microphone array 111. The separation process is performed for each processing frame, that is, at predetermined time intervals. For example, beamforming processing is performed at predetermined time intervals, and sound for each area is acquired. The separately acquired voice is output to the voice signal processing unit 114 and the storage unit 115.

分離エリア制御部１１３は、音源の分離や再生用信号の生成等を行うための処理負荷に応じてマイクアレイが収音する一定の空間の複数の分離エリアへの分割を制御する。具体的には、複数の分離エリアの配置及び個数を制御する。例えば、処理装置の処理負荷が大きく、全てのエリアの音源分離処理を行うと処理がリアルタイム再生に間に合わない場合、分離エリア制御部１１３は音源分離部１１２で行う音源分離エリアを結合してエリア数を減らす。例えば、処理が十分に間に合っている状態では、例えば図２（Ａ）のように収音空間Ａ１を８×８＝６４個の分離エリアＡ２に細かくエリア分割する。処理が間に合わなくなった場合には、例えば前フレームの処理においてそのエリアの音声が所定のレベル以上であったか否かを判定し、所定のレベル未満のエリアについては図２（Ｂ）に示すようにエリアを結合しエリア数を減らす。所定のレベル以上の音声は有意な音声である蓋然性が高い一方で、所定のレベル未満の音声は雑音等の有意でない音声である蓋然性が高い。そのため、音声が所定のレベル以上のエリアには細かい分離エリアを優先的に割り当てることで、有意な音声を忠実に再現するとともに、所定レベル未満のエリアでは分離エリアを統合することで、処理を高速化することができる。 The separation area control unit 113 controls the division of a certain space in which the microphone array collects sound into a plurality of separation areas according to the processing load for separating the sound source and generating the reproduction signal. Specifically, the arrangement and number of a plurality of separation areas are controlled. For example, if the processing load of the processing device is large and the processing is not in time for real-time playback when the sound source separation processing of all areas is performed, the separation area control unit 113 combines the sound source separation areas performed by the sound source separation unit 112 to obtain the number of areas. To reduce. For example, when the processing is sufficiently in time, the sound collecting space A1 is finely divided into 8 × 8 = 64 separation areas A2 as shown in FIG. 2A, for example. When the processing is not in time, for example, it is determined whether or not the sound in the area is above the predetermined level in the processing of the previous frame, and the area below the predetermined level is the area as shown in FIG. 2 (B). To reduce the number of areas. Voices above a predetermined level are more likely to be significant voices, while voices below a predetermined level are more likely to be insignificant voices such as noise. Therefore, by preferentially assigning a fine separation area to the area where the voice is above the predetermined level, the significant voice is faithfully reproduced, and in the area below the predetermined level, the separation area is integrated to speed up the processing. Can be transformed into.

エリアの分離サイズ変化の例を図３に示す。図３（Ｄ）は、処理負荷に基づいたエリア制御が行われている（エリア制御ＯＮ）か否（エリア制御ＯＦＦ）かの状態を示している。ｆｐからｆｐ＋７はフレーム番号を表す。図３（Ｃ）は、エリアごとに分離した音声のレベルが所定のレベル以上（音有）か所定のレベル未満（音無）かの状態を示している。ここではフレームｆｐ＋１およびｆｐ＋３において音有の状態となっている。図３（Ｂ）は最も細かく分割されたエリアの分割サイズを示している。この分割サイズは、収音空間Ａ１の面積を１とした場合の最小エリアの面積を表している。例えばフレームｆｐでは空間を６４のエリアに等分割しているため最少のエリアサイズは１／６４となっている。図３（Ａ）は、各フレームが複数のエリアに分離された様子を示している。 An example of the area separation size change is shown in FIG. FIG. 3D shows a state in which area control is performed based on the processing load (area control ON) or not (area control OFF). fp to fp + 7 represent frame numbers. FIG. 3C shows a state in which the sound level separated for each area is equal to or higher than a predetermined level (with sound) or lower than a predetermined level (without sound). Here, the frames fp + 1 and fp + 3 are in a state of having sound. FIG. 3B shows the division size of the most finely divided area. This division size represents the area of the minimum area when the area of the sound collecting space A1 is 1. For example, in the frame fp, the space is equally divided into 64 areas, so the minimum area size is 1/64. FIG. 3A shows how each frame is separated into a plurality of areas.

ｆｐ＋１からｆｐ＋６までが処理負荷が大きくエリア数を減らす必要がある時間である。フレームｆｐにおいては、どこのエリアでも音声レベルが所定値を超えなかった（図３（Ｃ）で音無）。そのため、フレームｆｐ＋１ではエリアサイズは１辺が収音空間の１／２で収音空間を４つに分割した大きなエリアになる（図３（Ｂ）で１／４）。 The time from fp + 1 to fp + 6 is the time when the processing load is large and the number of areas needs to be reduced. In the frame fp, the audio level did not exceed the predetermined value in any area (no sound in FIG. 3C). Therefore, in the frame fp + 1, the area size is a large area in which one side is 1/2 of the sound collecting space and the sound collecting space is divided into four (1/4 in FIG. 3B).

フレームｆｐ＋１では音声レベルが所定値を超えたエリアがあった（図３（Ｃ）で音有）。そのため、フレームｆｐ＋２では音声のあったエリアＡ３は再び１辺が収音空間Ａ１の１／８の小さなエリアに分割される（図３（Ｂ）で１／６４）。 In frame fp + 1, there was an area where the sound level exceeded a predetermined value (with sound in FIG. 3C). Therefore, in the frame fp + 2, one side of the area A3 where the sound was present is again divided into a small area of 1/8 of the sound collecting space A1 (1/64 in FIG. 3B).

続いてフレームｆｐ＋２ではどこのエリアでも音声レベルが所定値を超えなかった（図３（Ｃ）で音無）。そのため、フレームｆｐ＋３では、一部のエリアが結合され１辺が収音空間の１／４の中間の大きさのエリアに分割される（図３（Ｂ）で１／１６）。 Subsequently, in the frame fp + 2, the audio level did not exceed the predetermined value in any area (no sound in FIG. 3C). Therefore, in the frame fp + 3, some areas are combined and one side is divided into an area having an intermediate size of 1/4 of the sound collecting space (1/16 in FIG. 3B).

フレームｆｐ＋３では音声レベルが所定値を超えたエリアがあった（図３（Ｃ）で音有）。そのためフレームｆｐ＋４では音声のあったエリアＡ３は再び１辺が収音空間の１／８の小さなエリアに分割される（図３（Ｂ）で１／６４）。 In frame fp + 3, there was an area where the sound level exceeded a predetermined value (with sound in FIG. 3C). Therefore, in the frame fp + 4, the area A3 where the sound was present is once again divided into a small area with one side of 1/8 of the sound collecting space (1/64 in FIG. 3B).

フレームｆｐ＋４、ｆｐ＋５ではどこのエリアも音声レベルが所定値を超えなかった（図３（Ｃ）で音無）。そのため、エリアが結合されフレームｆｐ＋６では１辺が収音空間の１／２で収音空間を４つに分割した大きなエリアになる。 In frames fp + 4 and fp + 5, the audio level did not exceed the predetermined value in any area (no sound in FIG. 3C). Therefore, the areas are combined, and in the frame fp + 6, one side is 1/2 of the sound collecting space, and the sound collecting space is divided into four large areas.

分離エリア制御部１１３は、このようにして音声検出の有無に応じて分離エリア数を増減させる。ここで分離エリア制御部１１３は音源分離エリアを結合してエリア数を減らす例を説明した。もっとも、実際には音源分離部１１２に複数のエリアサイズに分離するビームフォーミング用のフィルタを持ち、分離エリア制御部１１３は使用するフィルタを制御するようにしてもよい。 In this way, the separation area control unit 113 increases or decreases the number of separation areas depending on the presence or absence of voice detection. Here, the separation area control unit 113 has described an example in which the sound source separation areas are combined to reduce the number of areas. However, in reality, the sound source separation unit 112 may have a filter for beamforming that separates into a plurality of area sizes, and the separation area control unit 113 may control the filter to be used.

さらに分離エリア制御部１１３では分離エリア制御によって結合したエリアについてフレームと結合したエリア情報を分離エリア制御リストとして管理する。例えばフレームｆｑにおいて４つのエリアを結合した場合、フレームｆｑと４つのエリアがリストとして管理される。ここでエリアはあらかじめＩＤなどを付けて区別が付けられるようにしておく。分離エリア制御部１１３は、処理の負荷が小さくなったことに応じて分離エリア制御リストに記録されたフレームと結合されたエリアについてそれぞれのエリアの音源分離を行うように音源分離部１１２へ指示を出す。音源分離が行われるとそのフレームとエリアはリストから削除される。 Further, the separation area control unit 113 manages the area information combined with the frame for the area combined by the separation area control as a separation area control list. For example, when four areas are combined in the frame fq, the frame fq and the four areas are managed as a list. Here, the areas are given an ID or the like in advance so that they can be distinguished. The separation area control unit 113 instructs the sound source separation unit 112 to separate the sound sources of each area for the area combined with the frame recorded in the separation area control list according to the reduction in the processing load. put out. When sound source separation is performed, the frame and area are deleted from the list.

音声信号処理部１１４では、フレーム、エリアごとの音声信号の処理を行う。音声信号処理部１１４で行われる処理は、例えば、エリアと収音装置の距離による影響を補正するための遅延補正処理、ゲイン補正処理や、エコー除去などである。 The audio signal processing unit 114 processes the audio signal for each frame and area. The processing performed by the audio signal processing unit 114 is, for example, delay correction processing for correcting the influence of the distance between the area and the sound collecting device, gain correction processing, echo cancellation, and the like.

記憶部１１５は、例えばＨＤＤ（ハードディスクドライブ）やＳＳＤ（ソリッドステートドライブ）、メモリのような記憶装置である。記憶部１１５は、音源分離部１１２において分離エリア制御されたフレームの全音声チャンネルの信号と音声信号処理部１１４で音声信号処理を行った信号を、時刻情報とともに記録する。 The storage unit 115 is a storage device such as an HDD (hard disk drive), an SSD (solid state drive), or a memory. The storage unit 115 records the signals of all the audio channels of the frame whose separation area is controlled by the sound source separation unit 112 and the signals processed by the audio signal processing unit 114 together with the time information.

リアルタイム再生用信号生成部１１６では音源分離部１１２から得たエリアごとの音声を収音から所定の時間内にミキシングすることでリアルタイム再生用の信号を生成し出力する。例えば、外部から時間に応じて変化する空間内の仮想の聴取点と仮想の聴取者の向き（以下、単に聴取点と聴取者の向きと称する）と、再生環境の情報とを取得し、音源のミキシングを行う。ここで再生環境とは、リアルタイム再生用信号生成部１１６で生成した信号を再生する再生装置がスピーカ（ステレオ、サラウンド、その他マルチチャンネル）か、あるはヘッドホンかといった、再生装置の構成に関する環境である。すなわち、音源のミキシングにおいては、各分割エリアの音声信号を、再生装置のチャンネル数等の環境に合わせて合成・変換する処理を行う。 The real-time reproduction signal generation unit 116 generates and outputs a real-time reproduction signal by mixing the sound for each area obtained from the sound source separation unit 112 within a predetermined time from the sound collection. For example, a sound source is obtained by acquiring information on a virtual listening point and a virtual listener's orientation (hereinafter, simply referred to as a listening point and a listener's orientation) in a space that changes with time from the outside, and information on a playback environment. Mixing. Here, the playback environment is an environment related to the configuration of the playback device, such as whether the playback device that reproduces the signal generated by the real-time playback signal generation unit 116 is a speaker (stereo, surround, or other multi-channel) or headphones. .. That is, in the mixing of sound sources, the audio signals of each divided area are combined and converted according to the environment such as the number of channels of the playback device.

リプレイ再生用信号生成部１１７は、リプレイ再生が要求された場合に、該当する時刻のデータを記憶部１１５から取得し、リアルタイム再生用信号生成部１１６と同様の処理を行い出力する。 When the replay reproduction is requested, the replay reproduction signal generation unit 117 acquires the data at the corresponding time from the storage unit 115, performs the same processing as the real-time reproduction signal generation unit 116, and outputs the data.

図４は、音声信号処理装置１００のハードウェア構成例を示すブロック図である。音声信号処理装置１００は、例えば、パーソナルコンピュータ（ＰＣ）や組込みシステム、タブレット端末、スマートフォン等により実現される。 FIG. 4 is a block diagram showing a hardware configuration example of the audio signal processing device 100. The audio signal processing device 100 is realized by, for example, a personal computer (PC), an embedded system, a tablet terminal, a smartphone, or the like.

図４において、ＣＰＵ９９０は中央演算処理装置であり、コンピュータプログラムに基づいて他の構成要素と協働し、音声信号処理装置１００全体の動作を制御する。ＲＯＭ９９１は読出し専用メモリであり、基本プログラムや基本処理に使用するデータ等を記憶する。ＲＡＭ９９２は書込み可能メモリであり、ＣＰＵ９９０のワークエリア等として機能する。 In FIG. 4, the CPU 990 is a central processing unit, which cooperates with other components based on a computer program to control the operation of the entire audio signal processing device 100. The ROM 991 is a read-only memory that stores a basic program, data used for basic processing, and the like. The RAM 992 is a writable memory and functions as a work area or the like of the CPU 990.

外部記憶ドライブ９９３は記録媒体へのアクセスを実現し、ＵＳＢメモリ等のメディア（記録媒体）９９４に記憶されたコンピュータプログラムやデータを本システムにロードすることができる。ストレージ９９５はＳＳＤ（ソリッドステートドライブ）等の大容量メモリとして機能する装置である。ストレージ９９５には、各種コンピュータプログラムやデータが格納される。 The external storage drive 993 realizes access to the recording medium, and can load computer programs and data stored in the medium (recording medium) 994 such as a USB memory into the system. The storage 995 is a device that functions as a large-capacity memory such as an SSD (solid state drive). Various computer programs and data are stored in the storage 995.

操作部９９６はユーザからの指示やコマンドの入力を受け付ける装置であり、キーボードやポインティングデバイス、タッチパネル等がこれに相当する。ディスプレイ９９７は、操作部９９６から入力されたコマンドや、それに対する音声信号処理装置１００の応答出力等を表示する表示装置である。インターフェイス（Ｉ／Ｆ）９９８は外部装置とのデータのやり取りを中継する装置である。また、マイクアレイ１１１は、インターフェイス９９８を介して音声信号処理装置１００に接続される。システムバス９９９は、音声信号処理装置１００内のデータの流れを司るデータバスである。 The operation unit 996 is a device that receives instructions and command inputs from the user, and corresponds to a keyboard, a pointing device, a touch panel, and the like. The display 997 is a display device that displays a command input from the operation unit 996 and a response output of the audio signal processing device 100 to the command. The interface (I / F) 998 is a device that relays the exchange of data with an external device. Further, the microphone array 111 is connected to the audio signal processing device 100 via the interface 998. The system bus 999 is a data bus that controls the flow of data in the audio signal processing device 100.

図１の各機能要素は、ＣＰＵ９９０がコンピュータプログラムに基づき装置全体を制御することにより実現される。なお、以上の各装置と同等の機能を実現するソフトウェアにより、ハードウェア装置の代替として構成することもできる。 Each functional element of FIG. 1 is realized by the CPU 990 controlling the entire device based on a computer program. It should be noted that the software that realizes the same functions as each of the above devices can be configured as a substitute for the hardware device.

（処理手順）
続いて、音声信号処理装置１００が実行する処理の手順について図５を参照して説明する。図５（Ａ）から図５（Ｃ）は、本実施形態の音声信号処理装置１００が実行する処理の手順を示すフローチャートである。 (Processing procedure)
Subsequently, the procedure of the processing executed by the audio signal processing device 100 will be described with reference to FIG. 5 (A) to 5 (C) are flowcharts showing a procedure of processing executed by the audio signal processing device 100 of the present embodiment.

図５（Ａ）は、収音からリアルタイム再生用信号を生成するまでのフローである。はじめに、マイクアレイ１１１において空間内の音の収音が行われる（Ｓ１１１）。収音された各チャンネルの音声信号は音源分離部１１２へ出力される。 FIG. 5A is a flow from sound collection to generation of a real-time reproduction signal. First, the microphone array 111 collects sounds in space (S111). The picked-up audio signals of each channel are output to the sound source separation unit 112.

続いて分離エリア制御部１１３において処理の負荷の観点から音源分離がリアルタイム再生に間に合うか否かを判定する（Ｓ１１２）。この処理は、図３を参照して説明したように、所定のレベルの音声の有無等に基づいて行われる。 Subsequently, the separation area control unit 113 determines whether or not the sound source separation is in time for real-time reproduction from the viewpoint of the processing load (S112). As described with reference to FIG. 3, this process is performed based on the presence or absence of a predetermined level of voice and the like.

リアルタイム再生に間に合わないと判定された場合（Ｓ１１２でＮＯ）、分離エリア制御部１１３では音源分離エリアが少なくなるようにエリア数を制御する（Ｓ１１３）。具体的には、例えば、一定レベル以上の音声が検出されないエリア等の重要度の低い分離エリアを統合して分離エリアの個数を減少させる。そして、どのようなエリアで分離するかという情報を音源分離部１１２へ出力する。さらに分離エリア制御部１１３では分離エリア制御リストを作成する。 When it is determined that the real-time reproduction is not in time (NO in S112), the separation area control unit 113 controls the number of areas so that the sound source separation area is reduced (S113). Specifically, for example, the number of separated areas is reduced by integrating less important separated areas such as areas where voice above a certain level is not detected. Then, the information on what kind of area is to be separated is output to the sound source separation unit 112. Further, the separation area control unit 113 creates a separation area control list.

続いて記憶部１１５において分離エリア制御を行ったフレームの音声信号を記録する（Ｓ１１４）。 Subsequently, the storage unit 115 records the audio signal of the frame whose separation area is controlled (S114).

リアルタイム再生に間に合うと判定された場合、あるいはＳ１１４において記録を行った後、音源分離部１１２において音源分離が行われる（Ｓ１１５）。すなわち、Ｓ１１１で集音したマルチチャネルの信号をもとに、各分離エリアにおける音声を合成する。前述のように、分離エリアの音声は、マイクアレイ１１１を構成するマイクロホンと、分離エリアの位置との関係に基づき、各マイクロホンが収集した音声信号に位相制御および重みづけをして加算することで再現することができる。分離されたエリアごとの音声信号は音声信号処理部１１４へ出力される。 When it is determined that the sound source is in time for real-time reproduction, or after recording in S114, the sound source separation unit 112 performs sound source separation (S115). That is, the voice in each separation area is synthesized based on the multi-channel signal collected in S111. As described above, the sound in the separation area is added by phase-controlling and weighting the sound signals collected by each microphone based on the relationship between the microphones constituting the microphone array 111 and the position of the separation area. It can be reproduced. The audio signal for each separated area is output to the audio signal processing unit 114.

続いて音声信号処理部１１４において分離エリアごとの音声信号の処理を行う（Ｓ１１６）。音声信号処理部１１４による処理は、前述のように、例えば、分離エリアと収音装置との距離による影響を補正するための遅延補正処理、ゲイン補正処理や、エコー除去による雑音処理などである。処理された音声信号はリアルタイム再生用信号生成部１１６および記憶部１１５へ出力される。 Subsequently, the audio signal processing unit 114 processes the audio signal for each separation area (S116). As described above, the processing by the audio signal processing unit 114 includes, for example, delay correction processing for correcting the influence of the distance between the separation area and the sound collecting device, gain correction processing, noise processing by echo cancellation, and the like. The processed audio signal is output to the real-time reproduction signal generation unit 116 and the storage unit 115.

続いてリアルタイム再生用信号生成部１１６においてリアルタイム再生用の音声のミキシングが行われる（Ｓ１１７）。ミキシングにおいては、再生機器の仕様（例えば、チャンネル数等）に合わせて再生できるように信号を合成・変換したりする。リアルタイム再生用にミキシングされた音声は外部の再生機器あるいは放送用信号として出力される。 Subsequently, the real-time reproduction signal generation unit 116 mixes the real-time reproduction audio (S117). In mixing, signals are synthesized and converted so that they can be played back according to the specifications of the playback device (for example, the number of channels). The audio mixed for real-time playback is output as an external playback device or a broadcast signal.

続いて記憶部１１５において各エリアの音声の記録が行われる（Ｓ１１８）。リプレイ再生用の音声信号は記憶部１１５のエリアごとの音声を用いて作成される。 Subsequently, the storage unit 115 records the voice of each area (S118). The audio signal for replay reproduction is created using the audio for each area of the storage unit 115.

次に、図５（Ｂ）を用いて図５（Ａ）のＳ１１２においてリアルタイム再生に処理が間に合わなかった場合（Ｓ１１２でＮＯ）の処理を説明する。 Next, the process when the process is not in time for the real-time reproduction in S112 of FIG. 5 (A) (NO in S112) will be described with reference to FIG. 5 (B).

分離エリア制御部１１３では処理装置の負荷が所定値より低い場合に、分離エリア制御リストに基づいて記憶部１１５からデータを読み出す（Ｓ１２１）。 When the load of the processing device is lower than the predetermined value, the separation area control unit 113 reads data from the storage unit 115 based on the separation area control list (S121).

続いて分離エリア制御リストに記載のエリアを結合して音源分離を行ったエリアについて再度結合前のエリアについて音源分離処理を行う（Ｓ１２２）。処理を行った音声信号は音声信号処理部１１４へ出力する。対応するフレームとエリアは処理が終わると分離エリア制御リストから削除される。Ｓ１２３はＳ１１６と同様のため詳細な説明を省略する。 Subsequently, the sound source separation processing is performed again for the area before the combination for the area where the sound source separation is performed by combining the areas described in the separation area control list (S122). The processed audio signal is output to the audio signal processing unit 114. The corresponding frames and areas are deleted from the isolated area control list when processing is complete. Since S123 is the same as S116, detailed description thereof will be omitted.

続いて記憶部１１５では入力されたエリアの音声信号を以前のデータに上書きし記録する（Ｓ１２４）。 Subsequently, the storage unit 115 overwrites the input area audio signal with the previous data and records it (S124).

次に、図５（Ｃ）を用いてリプレイが要求された場合の処理フローを説明する。リプレイが要求されると、リプレイ再生用信号生成部１１７は記憶部１１５からリプレイ時間に対応したエリアごとの音声信号を読み出す（Ｓ１３１）。 Next, a processing flow when a replay is requested will be described with reference to FIG. 5 (C). When the replay is requested, the replay playback signal generation unit 117 reads out the audio signal for each area corresponding to the replay time from the storage unit 115 (S131).

続いてリプレイ再生用信号生成部１１７においてリプレイ再生用の音声のミキシングが行われる（Ｓ１３２）。リプレイ再生用にミキシングされた音声は外部の再生機器あるいは放送用信号として出力される。 Subsequently, the replay playback signal generation unit 117 mixes the replay playback audio (S132). The audio mixed for replay playback is output as an external playback device or a broadcast signal.

以上説明したように、処理負荷に応じて分離エリアを制御する。すなわち、一定の空間において、音源の分離及び再生用信号の生成の少なくともいずれかの処理の負荷がより大きい領域を、より細かい分離エリアに分割するように制御する。そのため、音量レベルが所定値より低いエリアの分離度は低下するが、音量レベルが所定値以上のエリアは高い分解能でリアルタイム再生用信号生成に間に合う。さらに処理負荷が軽い時に分離エリア制御したエリアの分離を行う事でリプレイ時には十分な分解能のデータを得ることができる。 As described above, the separation area is controlled according to the processing load. That is, in a certain space, a region in which at least one of processing of sound source separation and reproduction signal generation has a larger processing load is controlled to be divided into finer separation areas. Therefore, the degree of separation of the area where the volume level is lower than the predetermined value is lowered, but the area where the volume level is higher than the predetermined value is in time for the real-time reproduction signal generation with high resolution. Furthermore, when the processing load is light, the separated area is separated so that data with sufficient resolution can be obtained during replay.

本実施形態においてマイクアレイ１１１はマイクロホンからなる例を説明したが、反射板などの構造物とセットであってもよい。またマイクアレイ１１１で使用するマイクロホンは無指向性であってもよいし、指向性マイクであってもよく、それらの混合でもよい。 Although the example in which the microphone array 111 is composed of a microphone has been described in the present embodiment, it may be a set with a structure such as a reflector. Further, the microphone used in the microphone array 111 may be omnidirectional, may be a directional microphone, or may be a mixture thereof.

本実施形態において音源分離部１１２はビームフォーミングを用いてエリアごとの音声収音を行う例を説明したが、その他の音源分離を用いてもよい。例えばエリアごとのパワースペクトル密度(ＰＳＤ)を推定し、推定したＰＳＤに基づいてウィナーフィルタによる分離を行ってもよい。 In the present embodiment, the sound source separation unit 112 has described an example in which sound is picked up for each area by using beamforming, but other sound source separation may be used. For example, the power spectral density (PSD) for each area may be estimated, and separation by a Wiener filter may be performed based on the estimated PSD.

本実施形態において分離エリア制御部１１３はエリアの音声レベルが所定値以上か否かで分離エリアを制御する例を説明したが、その他の判定基準を持っていてもよい。例えば同じ音声を使用する場合でも、レベルではなく、音の特徴量を検出する構成を備え、特徴量の有無を判定してもよい。具体的には、音声の特徴量解析により悲鳴や銃声や、ボールの音、自動車の音などが音声に含まれる場合など、予め定められた特徴を示す音声が検出されたときは分離エリアを小さくして、詳細な音声を再現するようにしてもよい。また、例えば全てのエリアを含む空間を撮影し、その撮影した動画像から分離エリアを制御してもよい。例えば、動画から人物や動物、マーカ等の特定の被写体を検出し、その被写体周辺の分離エリアの大きさがより小さくなるように制御してもよい。 In the present embodiment, the separation area control unit 113 has described an example in which the separation area is controlled depending on whether or not the sound level of the area is equal to or higher than a predetermined value, but other determination criteria may be provided. For example, even when the same voice is used, the presence or absence of the feature amount may be determined by providing a configuration for detecting the feature amount of the sound instead of the level. Specifically, when voices showing predetermined characteristics are detected, such as when screams, gunshots, ball sounds, car sounds, etc. are included in the voice feature analysis, the separation area is reduced. Then, the detailed sound may be reproduced. Further, for example, a space including all areas may be photographed, and the separation area may be controlled from the photographed moving image. For example, a specific subject such as a person, an animal, or a marker may be detected from the moving image, and the size of the separation area around the subject may be controlled to be smaller.

またテレビ放送などの生中継では、時間調整や、不慮の事態に対応するため実際の撮影から数秒から数分程度の一定の遅延を持たせて放送するようなシステムが一般に知られている。そのようなシステムを用いた場合、分離エリア制御部１１３は遅延時間分の映像や音声に含まれる事象に応じて分離順序を制御してもよい。例えば、スポーツのライブ中継において２分の遅延がある場合、２分間の試合展開から分離エリアを設定して、音源分離をしてもよい。例えばサッカーなどの競技においてゴールが決まると、２分間の映像からゴールを決めた選手やボールの動きを検出し、その軌跡周辺の分離エリアが細かくなるように設定されるようになっていてもよい。反対に選手やボールが入らないエリアについては分離エリアが粗くなるように設定されるようにするとよい。 Further, in live broadcasting such as television broadcasting, a system is generally known that broadcasts with a certain delay of about several seconds to several minutes from the actual shooting in order to adjust the time and respond to an unexpected situation. When such a system is used, the separation area control unit 113 may control the separation order according to the events included in the video and audio for the delay time. For example, if there is a delay of 2 minutes in live sports broadcasting, a separation area may be set from the 2-minute game development to separate the sound sources. For example, when a goal is scored in a game such as soccer, the movement of the player or ball who scored the goal may be detected from a two-minute video, and the separation area around the trajectory may be set to be finer. .. On the other hand, for areas where players and balls do not enter, it is advisable to set the separation area to be rough.

また本実施形態では分離エリア制御部１１３はエリア数を極力減らしたが、処理負荷に応じてエリア数を計算し、必要最低限のエリア数を低減するようにしてもよい。 Further, in the present embodiment, the separation area control unit 113 reduces the number of areas as much as possible, but the number of areas may be calculated according to the processing load to reduce the minimum number of necessary areas.

また本実施形態では分離エリア制御部１１３は前フレームの音声のレベルを用いて分離エリアを制御したが、処理フレームの情報を用いて分離エリアを制御してもよい。つまり、分離エリア制御部１１３は分離したエリアの音声のレベルが所定値以上であれば、そのエリアをさらに細かく分割したエリアでの音源分離を行うように音源分離部１１２へ指示する。分離エリア制御部１１３および音源分離部１１２はこの処理をエリアが所定のサイズまで小さくなるまで繰り返し行う。このようにして１フレーム分、分離エリア制御が遅れないようにすることができる。ただし、この手法は音源数が増えると、処理量が増えてしまうため、あらかじめ音源数が少ないとわかっている場面で用いるか、繰り返しの回数を処理負荷の許容範囲内に制限するようにするとよい。 Further, in the present embodiment, the separation area control unit 113 controls the separation area by using the voice level of the previous frame, but the separation area may be controlled by using the information of the processing frame. That is, if the sound level of the separated area is equal to or higher than a predetermined value, the separation area control unit 113 instructs the sound source separation unit 112 to perform sound source separation in the area further divided into the areas. The separation area control unit 113 and the sound source separation unit 112 repeat this process until the area becomes smaller to a predetermined size. In this way, it is possible to prevent the separation area control from being delayed by one frame. However, since this method increases the amount of processing as the number of sound sources increases, it is better to use it in situations where it is known in advance that the number of sound sources is small, or to limit the number of repetitions within the allowable range of the processing load. ..

本実施形態において音声信号処理部１１４は遅延補正処理、ゲイン補正処理、エコー除去を行うとしたが、他の処理も行ってもよい。例えばエリアごとの雑音除去処理などを行うようになっていてもよい。 In the present embodiment, the audio signal processing unit 114 is supposed to perform delay correction processing, gain correction processing, and echo cancellation, but other processing may also be performed. For example, noise removal processing for each area may be performed.

本実施形態においては、リプレイ再生用信号生成部１１７とリアルタイム再生用信号生成部１１６は同様の処理を行う例を説明した。ただし、リプレイ再生用信号生成部１１７とリアルタイム再生用信号生成部１１６では異なるミキシングをしてもよい。たとえばリアルタイム再生用信号生成部１１６では分離エリアの大きさが粗い音声が入力されることがあるため、処理の実施済みか否かに応じて例えばエリアサイズの大きいエリアはミキシング時のレベルを下げるなどしてもよい。 In the present embodiment, an example in which the replay reproduction signal generation unit 117 and the real-time reproduction signal generation unit 116 perform the same processing has been described. However, the replay reproduction signal generation unit 117 and the real-time reproduction signal generation unit 116 may be mixed differently. For example, in the real-time playback signal generation unit 116, audio with a coarse separation area may be input, so for example, an area with a large area size may have a lower mixing level depending on whether or not processing has been performed. You may.

また本実施形態では示さなかったが、図６に示すようにエリア制御の状況を表示装置に表示させる表示制御を行うようにしてもよい。例えば表示画面にはタイムバー５０１とタイムカーソル５０２、エリア分割表示５０３、エリア分割割合表示５０４等が表示される。ここで、タイムバー５０１は現在までの録音時間を表すバーで、タイムカーソル５０２の位置が表示画面の時間を表す。エリア分割表示５０３にはタイムカーソル５０２の指す時刻におけるエリアの分割状態を示す。この分割状態を示す画像は、実際の空間の画像や、実際の空間を再現したＣＧ等に重畳されて表示されるようにしてもよい。エリア分割割合表示５０４にはエリア分割のサイズごとの割合が表示される。あるいは図３のような画面が表示されていてもよい。このように表示を行うことで、エリア分割の状態を直感的に分かりやすくすることができる。またこの表示装置はさらにタッチパネルのような入力装置を備えていてもよい。例えばユーザがエリアサイズの大きくなっているエリアをタッチなどで選択し、そのエリアの分割を細かくする処理を優先的に行うように設定できるようにしてもよい。 Further, although not shown in the present embodiment, display control for displaying the status of area control on the display device may be performed as shown in FIG. For example, a time bar 501, a time cursor 502, an area division display 503, an area division ratio display 504, and the like are displayed on the display screen. Here, the time bar 501 is a bar representing the recording time up to the present, and the position of the time cursor 502 represents the time on the display screen. The area division display 503 shows the area division state at the time pointed to by the time cursor 502. The image showing the divided state may be superimposed on an image of the actual space, a CG that reproduces the actual space, or the like. The area division ratio display 504 displays the ratio for each size of the area division. Alternatively, a screen as shown in FIG. 3 may be displayed. By displaying in this way, it is possible to intuitively understand the state of area division. Further, this display device may further include an input device such as a touch panel. For example, the user may be able to select an area having a large area size by touch or the like, and set it so that the process of finely dividing the area is preferentially performed.

＜実施形態２＞
本発明の第二の実施形態（実施形態２）は複数のユーザがそれぞれ聴取点を設定し、その聴取点に応じた音響を再生装置で再生する音響システムに関する。 <Embodiment 2>
A second embodiment (Embodiment 2) of the present invention relates to an acoustic system in which a plurality of users set listening points and reproduce sound according to the listening points with a reproduction device.

（音響システム）
図７は音響システム２０の構成を示すブロック図である。音響システム２０は収音部２１と再生信号生成部２２、および複数の再生部２３を備える。収音部２１と再生信号生成部２２、複数の再生部２３は互いに有線もしくは無線の伝送経路を通じてデータの送受信を行う。収音部２１、再生信号生成部２２、及び、再生部２３の間の伝送経路はＬＡＮ等の専用の通信経路により実現されるが、インターネット等の公衆通信網を経由してもよい。 (Acoustic system)
FIG. 7 is a block diagram showing the configuration of the acoustic system 20. The sound system 20 includes a sound collecting unit 21, a reproduction signal generation unit 22, and a plurality of reproduction units 23. The sound collecting unit 21, the reproduction signal generation unit 22, and the plurality of reproduction units 23 transmit and receive data to and from each other through a wired or wireless transmission path. The transmission path between the sound collecting unit 21, the reproduction signal generation unit 22, and the reproduction unit 23 is realized by a dedicated communication path such as a LAN, but may be via a public communication network such as the Internet.

図８（Ａ）収音部２１の構成を示すブロック図、図８（Ｂ）は再生信号生成部２２の構成を示すブロック図、図８（Ｂ）は再生部２３の構成を示すブロック図である。図８（Ａ）の収音部２１は、マイクアレイ１１１、及び、収音信号送信部２１１を備える。マイクアレイ１１１は実施形態１と同様のため詳細な説明は省略する。収音信号送信部２１１はマイクアレイ１１１から入力されたマイク信号を送信する。 8 (A) is a block diagram showing the configuration of the sound collecting unit 21, FIG. 8 (B) is a block diagram showing the configuration of the reproduced signal generation unit 22, and FIG. 8 (B) is a block diagram showing the configuration of the reproduced unit 23. is there. The sound collecting unit 21 of FIG. 8A includes a microphone array 111 and a sound collecting signal transmitting unit 211. Since the microphone array 111 is the same as that of the first embodiment, detailed description thereof will be omitted. The sound pick-up signal transmission unit 211 transmits the microphone signal input from the microphone array 111.

図８（Ｂ）の再生信号生成部２２は、音源分離部１１２、分離エリア制御部１１３、音声信号処理部１１４、記憶部１１５、収音信号受信部２２１、聴取点受信部２２２、再生用信号生成部２２３、再生信号送信部２２４を備える。音源分離部１１２、音声信号処理部１１４、記憶部１１５は実施形態１とほぼ同様のため詳細な説明を省略する。 The reproduction signal generation unit 22 of FIG. 8B is a sound source separation unit 112, a separation area control unit 113, an audio signal processing unit 114, a storage unit 115, a sound collection signal reception unit 221 and a listening point reception unit 222, and a reproduction signal. It includes a generation unit 223 and a reproduction signal transmission unit 224. Since the sound source separation unit 112, the audio signal processing unit 114, and the storage unit 115 are substantially the same as those in the first embodiment, detailed description thereof will be omitted.

分離エリア制御部１１３は後述する聴取点受信部２２２から入力される複数の聴取点に基づいて音源分離部１１２の音源分離を行うエリアを制御する。ここで聴取点とは、ユーザが設定する空間内での仮想の聴取者の位置と向き、および時刻からなる情報である。例えば、分離エリア制御部１１３では再生信号生成部２２の処理負荷を監視し、負荷が大きくなると聴取点の分布に基づいて分離エリア数を減らすようにエリアを制御する。例えばリアルタイムで聴取しているユーザが設定している聴取者の位置が図９（Ａ）の様に分布したとする。その場合、図９（Ｂ）に示すように、より多くの聴取点が設定されているエリアの周辺を細かく分割し、聴取点が少ないエリアを粗く分割するようにエリアを制御する。 The separation area control unit 113 controls an area for sound source separation of the sound source separation unit 112 based on a plurality of listening points input from the listening point receiving unit 222, which will be described later. Here, the listening point is information including the position and orientation of a virtual listener in the space set by the user, and the time. For example, the separation area control unit 113 monitors the processing load of the reproduction signal generation unit 22, and controls the area so as to reduce the number of separation areas based on the distribution of listening points when the load becomes large. For example, it is assumed that the positions of listeners set by the user listening in real time are distributed as shown in FIG. 9A. In that case, as shown in FIG. 9B, the area around the area where more listening points are set is finely divided, and the area is controlled so as to roughly divide the area where there are few listening points.

また、過去の時刻の聴取点をユーザが指定してきた場合、つまりリプレイが要求された場合にはその時刻における分離エリアの状況と指定された視点に基づいて音源分離処理が必要か否かを判定し、必要な場合には処理負荷に応じて音源分離を実施する。例えば、指定された時刻においてエリア制御が行われていない場合、あるいはエリア制御されたが、今回指定された聴取点周辺は十分に細かいエリアで音源分離されている場合には改めて分離を行う必要はない。一方、指定された時刻においてエリア制御が行われ、かつ、今回指定された聴取点周辺のエリアの分割が粗い場合、分離エリア制御部１１３は聴取点の周辺のエリア分割を細かくするように音源分離部１１２へ制御信号を出力する。 In addition, when the user has specified a listening point at a past time, that is, when replay is requested, it is determined whether or not sound source separation processing is necessary based on the status of the separation area at that time and the specified viewpoint. If necessary, the sound source is separated according to the processing load. For example, if the area is not controlled at the specified time, or if the area is controlled but the sound source is separated in a sufficiently fine area around the listening point specified this time, it is necessary to perform the separation again. Absent. On the other hand, when the area is controlled at the specified time and the area around the listening point specified this time is roughly divided, the separation area control unit 113 separates the sound sources so as to finely divide the area around the listening point. A control signal is output to unit 112.

収音信号受信部２２１は収音部２１から収音信号を受信する。聴取点受信部２２２は複数の再生部２３の各々から聴取点を受信する。受信した聴取点は分離エリア制御部１１３および再生用信号生成部２２３へ出力する。再生用信号生成部２２３は、実施形態１のリアルタイム再生用信号生成部１１６とリプレイ再生用信号生成部１１７を合わせた機能を持つ。聴取点受信部２２２から入力された聴取者の位置と向き、時刻に応じて再生信号を生成する。入力された時刻がリアルタイムであればリアルタイム再生用信号生成部１１６と同様であり、時刻が過去であればリプレイ再生用信号生成部１１７と同様になる。聴取点ごとに生成した音声信号は再生信号送信部２２４へ出力される。再生信号送信部２２４では受信した聴取点ごとの音声信号を、それぞれの再生部２３へ出力する。 The sound collecting signal receiving unit 221 receives the sound collecting signal from the sound collecting unit 21. The listening point receiving unit 222 receives listening points from each of the plurality of reproducing units 23. The received listening point is output to the separation area control unit 113 and the reproduction signal generation unit 223. The reproduction signal generation unit 223 has a function of combining the real-time reproduction signal generation unit 116 and the replay reproduction signal generation unit 117 of the first embodiment. A playback signal is generated according to the position and orientation of the listener input from the listening point receiving unit 222 and the time. If the input time is real-time, it is the same as the real-time reproduction signal generation unit 116, and if the time is in the past, it is the same as the replay reproduction signal generation unit 117. The audio signal generated for each listening point is output to the reproduction signal transmission unit 224. The reproduction signal transmission unit 224 outputs the received audio signal for each listening point to each reproduction unit 23.

図８（Ｃ）の再生部２３は、聴取点入力部２３１、聴取点送信部２３２、再生信号受信部２３３、及び、スピーカ２３４を備える。聴取点入力部２３１は、ユーザが時刻と収音を行っている空間内の仮想的な聴取者の位置と聴取者の向きを設定できる入力装置である。聴取点入力部２３１は、キーボード、ポインティング装置、あるいは、タッチパネル等により実現される。設定された聴取点は聴取点送信部２３２へ出力される。 The reproduction unit 23 of FIG. 8C includes a listening point input unit 231, a listening point transmitting unit 232, a reproduction signal receiving unit 233, and a speaker 234. The listening point input unit 231 is an input device capable of setting the position of a virtual listener and the orientation of the listener in the space where the user is collecting the time and sound. The listening point input unit 231 is realized by a keyboard, a pointing device, a touch panel, or the like. The set listening point is output to the listening point transmission unit 232.

聴取点送信部２３２はユーザによって設定された聴取点を聴取点受信部２２２へ出力する。再生信号受信部２３３は聴取点入力部２３１で設定した聴取点に対応する音声信号を受信し、スピーカ２３４へ出力する。スピーカ２３４では入力された音声信号をＤ／Ａ変換してスピーカから放音する。 The listening point transmitting unit 232 outputs the listening point set by the user to the listening point receiving unit 222. The reproduction signal receiving unit 233 receives the audio signal corresponding to the listening point set by the listening point input unit 231 and outputs the audio signal to the speaker 234. The speaker 234 D / A-converts the input audio signal and emits the sound from the speaker.

（処理手順）
続いて、音響システム２０が実行する処理の手順について９４を参照して説明する。図１０Ａから図１０Ｃは、本実施形態の音響システム２０が実行する処理の手順を示すフローチャートである。 (Processing procedure)
Subsequently, the procedure of the process executed by the acoustic system 20 will be described with reference to 94. 10A to 10C are flowcharts showing a procedure of processing executed by the sound system 20 of the present embodiment.

図１０Ａに示すように、はじめにマイクアレイ１１１において空間内の音の収音が行われる（Ｓ２０１）。収音された音声は収音信号送信部２１１へ出力される。続いて収音信号が収音部２１の収音信号送信部２１１から送信され、再生信号生成部２２の収音信号受信部２２１において受信される（Ｓ２０２）。受信された収音信号は音源分離部１１２へ出力される。続いて複数の再生部２３の聴取点入力部２３１において聴取点が入力される（Ｓ２０３）。入力された聴取点は聴取点送信部２３２へ出力される。 As shown in FIG. 10A, first, sound is picked up in space in the microphone array 111 (S201). The picked-up voice is output to the pick-up signal transmission unit 211. Subsequently, the sound collection signal is transmitted from the sound collection signal transmission unit 211 of the sound collection unit 21, and is received by the sound collection signal reception unit 221 of the reproduction signal generation unit 22 (S202). The received sound pick-up signal is output to the sound source separation unit 112. Subsequently, listening points are input to the listening point input units 231 of the plurality of playback units 23 (S203). The input listening point is output to the listening point transmission unit 232.

続いて聴取点が聴取点送信部２３２から送信され、再生信号生成部２２の聴取点受信部２２２において受信される（Ｓ２０４）。受信された複数の聴取点は分離エリア制御部１１３および再生用信号生成部２２３へ出力される。 Subsequently, the listening point is transmitted from the listening point transmitting unit 232 and received by the listening point receiving unit 222 of the reproduction signal generation unit 22 (S204). The plurality of received listening points are output to the separation area control unit 113 and the reproduction signal generation unit 223.

続いて分離エリア制御部１１３において処理がリアルタイム再生に間に合うか否かの判定が行われる（Ｓ２０５）。リアルタイム再生に間に合うと判定された場合（Ｓ２０５でＹＥＳ）はＳ２０８へ進み、リアルタイム再生に間に合わないと判定された場合（Ｓ２０５でＮＯ）はＳ２０６へ進む。 Subsequently, the separation area control unit 113 determines whether or not the processing is in time for real-time reproduction (S205). If it is determined that the real-time reproduction is in time (YES in S205), the process proceeds to S208, and if it is determined that the real-time reproduction is not in time (NO in S205), the process proceeds to S206.

Ｓ２０６では、分離エリア制御部１１３において分離エリアの制御が行われる。すなわち、Ｓ２０６では複数のエリアを結合し、エリア数を減らす制御を音源分離部１１２へ出力する。さらに分離エリア制御リストを生成し、分離エリアの制御情報を管理する。続いて音源分離部１１２ではエリアが制御されると、そのフレームの収音信号を記憶部１１５へ出力し、記憶部１１５において入力された収音信号を記録する（Ｓ２０７）。そして、Ｓ２０８へ進む。 In S206, the separation area control unit 113 controls the separation area. That is, in S206, a plurality of areas are combined and control for reducing the number of areas is output to the sound source separation unit 112. Furthermore, a separation area control list is generated, and control information of the separation area is managed. Subsequently, when the area is controlled by the sound source separation unit 112, the sound collection signal of the frame is output to the storage unit 115, and the sound collection signal input by the storage unit 115 is recorded (S207). Then, the process proceeds to S208.

Ｓ２０８では、音源分離部１１２においてエリアごとの音源分離が行われる。分離されたエリアごとの音声信号は音声信号処理部１１４へ出力される。 In S208, the sound source separation unit 112 separates the sound sources for each area. The audio signal for each separated area is output to the audio signal processing unit 114.

続いて音声信号処理部１１４において音声信号の処理が行われる（Ｓ２０９）。処理された音声信号は記憶部１１５へ出力される。 Subsequently, the audio signal processing unit 114 processes the audio signal (S209). The processed audio signal is output to the storage unit 115.

続いて記憶部１１５において処理されたエリアごとの音声信号が記録される（Ｓ２１０）。続いて再生用信号生成部２２３では記憶部１１５から聴取点受信部２２２から入力された複数の聴取点の時刻に応じてエリアごとの音声を取得し、聴取点ごとに再生用の音声のミキシングが行われる（Ｓ２１１）。ミキシングされた複数の再生信号は再生信号送信部２２４へ出力される。 Subsequently, the voice signal for each area processed by the storage unit 115 is recorded (S210). Subsequently, the reproduction signal generation unit 223 acquires audio for each area according to the times of a plurality of listening points input from the listening point receiving unit 222 from the storage unit 115, and mixes the audio for reproduction for each listening point. It is done (S211). The plurality of mixed reproduction signals are output to the reproduction signal transmission unit 224.

続いて聴取点ごとに生成された複数の再生信号は再生信号送信部２２４から送信され、入力した聴取点に対応する再生信号が、それぞれの再生部２３の再生信号受信部２３３において受信される（Ｓ２１２）。最後に再生信号受信部２３３で受信した再生信号はスピーカから再生される（Ｓ２１３）。 Subsequently, a plurality of reproduction signals generated for each listening point are transmitted from the reproduction signal transmitting unit 224, and the reproduction signal corresponding to the input listening point is received by the reproduction signal receiving unit 233 of each reproduction unit 23 ( S212). Finally, the reproduced signal received by the reproduced signal receiving unit 233 is reproduced from the speaker (S213).

次に、図１０（Ｂ）を用いて図１０（Ａ）のＳ２０５において処理が間に合わないと判定された場合で、エリア数を減らした場合の処理を説明する。 Next, when it is determined in S205 of FIG. 10 (A) that the process is not in time using FIG. 10 (B), the process when the number of areas is reduced will be described.

分離エリア制御部１１３では処理負荷が所定値を下回った場合に、分離エリア制御リストを参照し、分離を行う時刻（フレーム）とエリアを決定する（Ｓ２２１）。分離するエリアや時刻の情報は音源分離部１１２へ出力される。 When the processing load falls below a predetermined value, the separation area control unit 113 refers to the separation area control list and determines the time (frame) and area for separation (S221). Information on the area to be separated and the time is output to the sound source separation unit 112.

続いて音源分離部１１２において、記憶部１１５から入力された時刻情報に基づいて収音信号を読み出す（Ｓ２２２）。Ｓ２２３からＳ２２５についてはＳ２０８からＳ２１０と同様のため詳細な説明を省略する。 Subsequently, the sound source separation unit 112 reads out the sound collection signal based on the time information input from the storage unit 115 (S222). Since S223 to S225 are the same as S208 to S210, detailed description thereof will be omitted.

以上説明したように、処理負荷および複数の聴取点の分布に基づいて分離エリアを結合して、エリア数を低減させる。そのため、重要な音声信号を忠実に再現することができるとともに、処理を効率化してリアルタイム処理を実現することができる。さらにリプレイ時にはリアルタイム時には伝送が間に合わなかったエリアに対しても分離された音を使って再生信号を生成できる。 As described above, the number of areas is reduced by combining the separated areas based on the processing load and the distribution of a plurality of listening points. Therefore, an important audio signal can be faithfully reproduced, and processing can be streamlined to realize real-time processing. Furthermore, during replay, it is possible to generate a playback signal using the separated sound even for areas where transmission was not in time in real time.

本実施形態において再生部２３は簡単のため全て同じ構成としたが、その構成は異なっていてもよい。本実施形態では記載しなかったが、自由視点映像を生成する自由視点映像生成システムと組み合わせて用いてもよい。例えば複数の撮像装置で音声を収音した空間と略同じ空間をあらゆる方向から撮像し、その撮像した画像から自由視点映像を生成する。その場合、聴取点は視点から算出するようになっていてもよいし、聴取点に連動して自由視点映像が生成されるようになっていてもよい。 In the present embodiment, the reproduction unit 23 has the same configuration for the sake of simplicity, but the configuration may be different. Although not described in this embodiment, it may be used in combination with a free viewpoint image generation system that generates a free viewpoint image. For example, a space substantially the same as the space in which sound is picked up by a plurality of imaging devices is imaged from all directions, and a free viewpoint image is generated from the captured images. In that case, the listening point may be calculated from the viewpoint, or the free viewpoint video may be generated in conjunction with the listening point.

本実施形態において再生用信号生成部２２３は再生信号生成部２２内に構成されたが、再生部２３内に構成されるようになっていてもよい。本実施形態において分離エリア制御部１１３は、複数の聴取者の位置のみを用いて分離エリアを決定したが図９（Ｃ）に示すように聴取者の向きに応じて聴取の向き前方の前方に存在する領域を細かく分割し、後方を粗く分割するようにしてもよい。 In the present embodiment, the reproduction signal generation unit 223 is configured in the reproduction signal generation unit 22, but may be configured in the reproduction unit 23. In the present embodiment, the separation area control unit 113 determines the separation area using only the positions of a plurality of listeners, but as shown in FIG. 9C, the separation area is in front of the listening direction according to the direction of the listeners. The existing area may be finely divided and the rear part may be roughly divided.

本実施形態においてエリア制御を行った場合、聴取点入力部２３１において入力できる聴取位置を制限するようにしてもよい。本実施形態において再生部２３は一律で扱ったが、分離エリアを制御するために聴取点ごとに異なる重みを持っていてもよい。また実施形態１と同様に、エリア制御の状況を表示する表示装置や分離エリア制御を指示する入力装置を備えていてもよい。 When area control is performed in the present embodiment, the listening position that can be input by the listening point input unit 231 may be limited. In the present embodiment, the reproduction unit 23 is treated uniformly, but may have different weights for each listening point in order to control the separation area. Further, as in the first embodiment, a display device for displaying the status of area control and an input device for instructing separation area control may be provided.

本発明の各実施形態においては、再生までの時間が限られているリアルタイム再生においても音源分離するエリアの数を制御することで空間全体を収音し、かつ重要なエリアの分解能を保ったまま再生することができる。 In each embodiment of the present invention, the entire space is picked up by controlling the number of sound source separation areas even in real-time reproduction in which the time until reproduction is limited, and the resolution of important areas is maintained. Can be played.

＜その他の実施形態＞
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 <Other Embodiments>
The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

１００：音声信号処理装置、１１１：マイクアレイ、１１２：音源分離部、１１３：分離エリア制御部、１１４：音声信号処理部、１１５：記憶部、１１６：リアルタイム再生用信号生成部、１１７：リプレイ再生用信号生成部 100: Audio signal processing device, 111: Microphone array, 112: Sound source separation unit, 113: Separation area control unit, 114: Audio signal processing unit, 115: Storage unit, 116: Real-time playback signal generation unit, 117: Replay playback Signal generator

Claims

An acquisition means for acquiring a sound collection signal based on sound collection by a plurality of microphones that collect sound in the sound collection area, and
A determination means for determining the position and size of a plurality of partial areas in the sound collection area based on the position of the object detected in the sound collection area.
An extraction means for extracting a plurality of area acoustic signals corresponding to the plurality of partial areas determined by the determination means from the sound collection signal acquired by the acquisition means, and an extraction means.
It has a generation means for generating a reproduction acoustic signal according to the position and orientation of a designated virtual listening point by acoustic processing using two or more area acoustic signals extracted by the extraction means.
The determining means makes the size of the partial area including the position of the object detected in the sound collecting area smaller than the size of the partial area including the position of the object detected in the sound collecting area. , A sound processing apparatus, characterized in that the size of the plurality of partial areas is determined .

An acquisition means for acquiring a sound collection signal based on sound collection by a plurality of microphones that collect sound in the sound collection area, and
A determination means for determining the position and size of a plurality of partial areas in the sound collection area based on the position of the object detected in the sound collection area.
An extraction means for extracting a plurality of area acoustic signals corresponding to the plurality of partial areas determined by the determination means from the sound collection signal acquired by the acquisition means, and an extraction means.
It has a generation means for generating a reproduction acoustic signal according to the position and orientation of a designated virtual listening point by acoustic processing using two or more area acoustic signals extracted by the extraction means.
The determining means is a portion in which the size of the partial area including the first number of objects detected in the sound collecting area includes a second number of objects detected in the sound collecting area, which is smaller than the first number. An audio processing device characterized in that the size of the plurality of partial areas is determined so as to be smaller than the size of the area.

An acquisition means for acquiring a sound collection signal based on sound collection by a plurality of microphones that collect sound in the sound collection area, and
A determination means for determining the position and size of a plurality of partial areas in the sound collection area based on the position of the object detected in the sound collection area.
An extraction means for extracting a plurality of area acoustic signals corresponding to the plurality of partial areas determined by the determination means from the sound collection signal acquired by the acquisition means, and an extraction means.
It has a generation means for generating a reproduction acoustic signal according to the position and orientation of a designated virtual listening point by acoustic processing using two or more area acoustic signals extracted by the extraction means.
The generation means is an acoustic processing apparatus characterized in that the reproduction acoustic signal is generated by synthesizing the two or more area acoustic signals based on the position and orientation of the virtual listening point.

An acquisition means for acquiring a sound collection signal based on sound collection by a plurality of microphones that collect sound in the sound collection area, and
A determination means for determining the position and size of a plurality of partial areas in the sound collection area based on the position of the object detected in the sound collection area.
An extraction means for extracting a plurality of area acoustic signals corresponding to the plurality of partial areas determined by the determination means from the sound collection signal acquired by the acquisition means, and an extraction means.
A generation means for generating a reproduction acoustic signal according to the position and orientation of a designated virtual listening point by acoustic processing using two or more area acoustic signals extracted by the extraction means.
An audio processing apparatus comprising: a display control means for displaying an image showing an arrangement of the plurality of partial areas determined by the determination means on a display unit.

The sound processing apparatus according to any one of claims 1 to 4, wherein the determination means determines the number of the plurality of partial areas included in the sound collecting region based on the position of the object. ..

Claims 1 to 5 are characterized in that the determination means determines the number of the plurality of partial areas included in the sound collecting region based on the processing load of at least one of the extraction means and the generation means. The sound processing apparatus according to any one of the above.

The acoustic processing according to any one of claims 1 to 6 , further comprising a detection means for detecting the position of an object in the sound collection region based on the sound collection signal acquired by the acquisition means. apparatus.

The sound according to any one of claims 1 to 7 , further comprising a detecting means for detecting the position of an object in the sound collecting area based on an image obtained by photographing the sound collecting area. Processing equipment.

An acquisition means for acquiring a sound collection signal based on sound collection by a plurality of microphones that collect sound in the sound collection area, and
A determining means for determining the position and size of a plurality of partial areas within the sound collecting region based on at least one of the positions and orientations of the designated virtual listening points.
An extraction means for extracting a plurality of area acoustic signals corresponding to the plurality of partial areas determined by the determination means from the sound collection signal acquired by the acquisition means, and an extraction means.
Acoustics characterized by having a generation means for generating a reproduction acoustic signal according to the position and orientation of the virtual listening point by acoustic processing using two or more area acoustic signals extracted by the extraction means. Processing equipment.

The determination means determines the size of the plurality of partial areas so that the size of the partial area including the position of the virtual listening point is smaller than the size of the partial area not including the position of the virtual listening point. 9. The sound processing apparatus according to claim 9 .

The determining means of the plurality of partial areas so that the size of the partial area located in the direction in which the virtual listening point is directed is smaller than the size of the partial area not located in the direction in which the virtual listening point is directed. The sound processing apparatus according to claim 9 , wherein the size is determined.

The acquisition process of acquiring the sound collection signal based on the sound collection by multiple microphones that collect the sound in the sound collection area, and
A determination step of determining the position and size of a plurality of partial areas in the sound collection area based on the position of the object detected in the sound collection area.
An extraction step of extracting a plurality of area acoustic signals corresponding to each of the plurality of partial areas determined in the determination step from the sound collection signal acquired in the acquisition step, and an extraction step.
It has a generation step of generating a reproduction acoustic signal according to the position and orientation of a designated virtual listening point by acoustic processing using two or more area acoustic signals extracted in the extraction step.
In the determination step, the size of the partial area including the position of the object detected in the sound collecting area is smaller than the size of the partial area including the position of the object detected in the sound collecting area. , A sound processing method comprising determining the size of the plurality of partial areas .

The acquisition process of acquiring the sound collection signal based on the sound collection by multiple microphones that collect the sound in the sound collection area, and
A determination step of determining the position and size of a plurality of partial areas in the sound collecting region based on at least one of the positions and orientations of the designated virtual listening points.
An extraction step of extracting a plurality of area acoustic signals corresponding to each of the plurality of partial areas determined in the determination step from the sound collection signal acquired in the acquisition step, and an extraction step.
Acoustics characterized by having a generation step of generating a reproduction acoustic signal according to the position and orientation of the virtual listening point by acoustic processing using two or more area acoustic signals extracted in the extraction step. Processing method.

In the determination step, the size of the plurality of partial areas is determined so that the size of the partial area including the position of the virtual listening point is smaller than the size of the partial area not including the position of the virtual listening point. 13. The sound processing method according to claim 13 .

A computer program for causing a computer to function as each means included in the sound processing apparatus according to any one of claims 1 to 11 .