WO2022034795A1

WO2022034795A1 - Signal processing device and method, noise cancelling device, and program

Info

Publication number: WO2022034795A1
Application number: PCT/JP2021/027823
Authority: WO
Inventors: 徹徳板橋; 直毅村田; 悠前野
Original assignee: ソニーグループ株式会社
Priority date: 2020-08-11
Filing date: 2021-07-28
Publication date: 2022-02-17

Abstract

The present technology pertains to a signal processing device and method, a noise cancelling device, and a program configured to enable a reduction in delay time. The signal processing device includes one or a plurality of signal processing units that carry out signal processing in the spatial frequency domain. The signal processing units carry out the signal processing on signals converted in the spatial frequency domain on the basis of microphone signals obtained by sound collection using a plurality of microphones. This technology can be applied to noise cancelling devices.

Description

Signal processing equipment and methods, noise canceling equipment, and programs

The present technology relates to signal processing devices and methods, noise canceling devices, and programs, and in particular, to signal processing devices and methods capable of reducing delay times, noise canceling devices, and programs.

Conventionally, spatial noise canceling (hereinafter, also referred to as spatial NC (Noise Canceling)), which performs noise canceling using wave field synthesis technology, is known.

For example, as one method when performing spatial NC using an annular microphone array and an annular speaker array, a method of performing operations such as filtering in the spatial frequency domain can be considered.

A technique for performing an operation in the spatial frequency domain (see, for example, Non-Patent Document 1) has been proposed in order to realize wave field synthesis, and if the arithmetic in the spatial frequency domain is used, between channels corresponding to a plurality of speakers. It is possible to realize higher spatial NC performance in consideration of the correlation of.

By the way, in order to realize spatial NC by performing filtering in the spatial frequency domain, time Fourier transform is performed on the microphone signals obtained by all the microphones, and spatial Fourier transform is performed on the resulting signals. It needs to be done and converted into a signal in the spatial frequency domain.

Also, even after filtering, it is necessary to perform the inverse transform of the spatial Fourier transform and the inverse transform of the temporal Fourier transform on the signal in the spatial frequency domain to return it to the signal in the time domain.

However, when such a time Fourier transform is performed, an algorithm delay that cannot be avoided in principle occurs, so that a delay occurs until the sound for spatial NC is output from the speaker, and the spatial NC performance deteriorates. ..

This technology was made in view of such a situation, and makes it possible to reduce the delay time.

The signal processing device of the first aspect of the present technology has one or a plurality of signal processing units that perform signal processing in the spatial frequency region, and the signal processing unit is a microphone obtained by sound collection by a plurality of microphones. The signal processing is performed on the signal converted in the spatial frequency region based on the signal.

The signal processing method or program of the first aspect of the present technology is a signal processing method or program of a signal processing apparatus having one or more signal processing units that perform signal processing in the spatial frequency region, and is the one or more. The signal processing unit includes a step of performing the signal processing on a signal converted in the spatial frequency region based on the microphone signals obtained by collecting sounds from a plurality of microphones.

In the first aspect of the present technology, in a signal processing apparatus having one or a plurality of signal processing units that perform signal processing in the spatial frequency region, the sound is picked up by a plurality of microphones by the one or a plurality of the signal processing units. The signal processing is performed on the signal converted in the spatial frequency region based on the microphone signal obtained by.

The noise canceling device on the second aspect of the present technology includes a plurality of microphones, one or a plurality of signal processing units that perform signal processing in the spatial frequency region, and a sound based on the noise canceling signal generated by the signal processing. The signal processing unit is provided with a plurality of speakers for outputting the signal, and the signal processing unit performs the signal processing on the signal converted in the spatial frequency region based on the microphone signals obtained by the sound collection by the plurality of microphones. The noise canceling signal is generated.

In the second aspect of the present technology, a plurality of microphones, one or a plurality of signal processing units that perform signal processing in the spatial frequency region, and a plurality of sounds based on the noise canceling signal generated by the signal processing are output. In the noise canceling device including the speaker, the signal processing unit performs the signal processing on the signal converted in the spatial frequency region based on the microphone signals obtained by the sound collection by the plurality of microphones. , The noise canceling signal is generated.

It is a figure which shows the structure of the parallel SISO system. It is a figure which shows the structure of the multipoint control MIMO system. It is a figure which shows the structure of the spatial frequency domain processing system. It is a figure which shows the configuration example of a noise canceling apparatus. It is a figure which shows the speaker arrangement example of a speaker array. It is a flowchart explaining the noise canceling process. It is a figure explaining the use of a microphone and a speaker. It is a figure explaining the use of a microphone and a speaker. It is a figure explaining the grouping of a microphone array and a speaker array. It is a figure which shows the configuration example of a noise canceling apparatus. It is a figure which shows the structural example of a signal processing part. It is a flowchart explaining the noise canceling process. It is a figure explaining the use of a microphone and a speaker. It is a figure explaining the grouping of a microphone array and a speaker array. It is a figure which shows the configuration example of a noise canceling apparatus. It is a figure which shows the configuration example of a computer.

Hereinafter, embodiments to which the present technology is applied will be described with reference to the drawings.

<First Embodiment>
<About this technology>
This technology realizes spatial NC that does not require time-frequency conversion and vice versa by directly performing spatial frequency conversion on the microphone signal in the time domain and converting it into a signal in the spatial frequency domain. be. As a result, the delay time can be reduced and higher spatial NC performance can be obtained.

In order to realize spatial NC at a desired frequency in a desired region, it is necessary to use as many microphones and speakers as possible to satisfy the Nyquist theorem of space.

Especially in general spatial NC, in order to obtain the canceling effect up to higher frequencies, it is necessary to perform a huge amount of calculation using a large number of microphones and speakers.

That is, in the spatial NC that uses the calculation of general spatial frequency domain processing, the temporal Fourier transform is performed on the microphone signals obtained by all the microphones, and the spatial Fourier transform is performed on the resulting signals. Will be. Then, after filtering is performed on the signal in the spatial frequency domain obtained by the spatial Fourier transform to generate a speaker signal, the spatial Fourier transform and the time Fourier transform are inversely transformed on all the speaker signals. , It is a speaker signal in the time domain.

As mentioned above, when the time Fourier transform is performed, an unavoidable time delay occurs in principle, so that a system delay occurs and it becomes difficult to realize high spatial NC performance in real time.

Further, in the spatial NC that uses the calculation of the general spatial frequency domain processing, the spatial Fourier transform, its inverse transform, and the filtering in the spatial frequency domain are performed using the signals obtained from the microphone signals of all the microphones.

Therefore, if a large number of microphones and speakers are used for the spatial NC, input / output lines to processing devices such as DSP (Digital Signal Processor) are required accordingly, and each process for the spatial NC must be processed by one device. Must be.

Therefore, not only the computational load of the device that performs spatial NC is large, but also a large number of input / output lines are required, and the device that has low processing capacity or few input / output lines cannot perform spatial NC. Will end up.

Therefore, in this technology, by directly converting the microphone signal in the time domain into the signal in the spatial frequency domain, it is possible to realize a spatial NC that does not require time frequency conversion or its reverse conversion.

As a result, not only the amount of calculation can be significantly reduced, but also the delay time can be reduced, and high spatial NC performance can be obtained in real time.

In addition, in this technology, microphones and speakers used for spatial NC are divided into multiple groups, and processing is performed for each group, so that multiple processing that could only be performed by one device in general spatial NC can be performed. It can be shared by devices and arithmetic units. As a result, the amount of calculation of each device or arithmetic unit can be reduced, and the number of input / output lines required for one device can also be reduced.

Then, the general noise canceling technology and this technology will be explained in more detail below.

FIG. 1 is a diagram showing a configuration of a multi-input multi-output system that realizes noise canceling with general headphones or the like, that is, a parallel SISO (Single Input Single Output) system.

The parallel SISO system shown in FIG. 1 has microphones 11-1 to 11-6, SISO filters 12-1 to SISO filters 12-6, and speakers 13-1 to 13-6.

The microphones 11-1 to 11-6 pick up the ambient sound, and supply the resulting microphone signal to the SISO filter 12-1 to the SISO filter 12-6.

The SISO filter 12-1 to SISO filter 12-6 filters the microphone signals supplied from the microphones 11-1 to 11-6 by the SISO filter in the time domain, and obtains the speaker signal as a result. It is supplied to the speakers 13-1 to the speakers 13-6.

The speaker signal is a speaker signal for outputting sound (hereinafter, also referred to as noise canceling sound) from the speaker so that noise noise is canceled in a target area (position), that is, noise canceling is realized. It is a drive signal. In other words, the speaker signal is a noise canceling signal for outputting a noise canceling sound from the speaker.

Speakers 13-1 to 13-6 output sound based on the speaker signals supplied from the SISO filters 12-1 to SISO filters 12-6, and realize noise canceling.

Hereinafter, when it is not necessary to distinguish microphones 11-1 to 11-6, they are also simply referred to as microphones 11, and when it is not necessary to distinguish SISO filters 12-1 to SISO filters 12-6, they are simply SISO. Also referred to as a filter 12. Further, hereinafter, when it is not necessary to distinguish between the speaker 13-1 and the speaker 13-6, the speaker 13 is also simply referred to as a speaker 13.

In FIG. 1, for example, a system consisting of a microphone 11-1, an SISO filter 12-1, and a speaker 13-1 is an SISO system corresponding to one channel, and a plurality of SISO systems are arranged in parallel in parallel. The SISO system is configured.

Since each channel, that is, each SISO system operates independently, the amount of calculation in each of those SISO systems can be small. However, in the parallel SISO system, since the correlation between channels is not taken into consideration, the higher the frequency, the greater the influence of the phase shift of the sound output by each speaker 13, and the lower the noise canceling effect.

FIG. 2 is a diagram showing a configuration of a multi-point control MIMO (Multi Input Multi Output) system. In FIG. 2, the same reference numerals are given to the portions corresponding to those in FIG. 1, and the description thereof will be omitted as appropriate.

The multipoint control MIMO system shown in FIG. 2 has microphones 11-1 to 11-6, a MIMO filter 41, and speakers 13-1 to 13-6.

In this multi-point control MIMO system, the microphone signals obtained from all the microphones 11 are input to one MIMO filter 41.

The MIMO filter 41 performs filtering by the MIMO filter on the microphone signal supplied from each microphone 11 to generate a speaker signal for each channel, and transfers the speaker signal of each channel to the speaker 13 corresponding to each channel. Output.

In the MIMO filter 41, since the filter calculation in the time domain is performed between each microphone 11 and each speaker 13, the correlation between channels is also taken into consideration, and a good noise canceling effect is obtained even at high frequencies. Obtainable.

However, in the multipoint control MIMO system, the larger the number of channels, the larger the amount of calculation in the MIMO filter 41 is in proportion to the square of the number of channels.

For example, in the parallel SISO system shown in FIG. 1, if the number of channels is 48 channels, the amount of filtering calculation for 48 channels is sufficient. On the other hand, in the multipoint control MIMO system shown in FIG. 2, when the number of channels is 48 channels, filtering of 2304 (= 48 ² ) channels is required, and the amount of calculation is significantly increased as compared with the system with a small number of channels. It ends up.

Therefore, as shown in FIG. 3, for example, it is known that the amount of calculation can be significantly reduced by performing processing in the spatial frequency domain by the spatial frequency domain processing system. In FIG. 3, the same reference numerals are given to the portions corresponding to those in FIG. 1, and the description thereof will be omitted as appropriate.

The spatial frequency domain processing system shown in FIG. 3 includes a microphone 11-1 to a microphone 11-6, a time FFT (Fast Fourier Transform) unit 71-1 to a time FFT unit 71-6, a space FFT unit 72, and a filter processing unit 73. It has a spatial inverse FFT unit 74, a time inverse FFT unit 75-1 to a time inverse FFT unit 75-6, and a speaker 13-1 to a speaker 13-6.

The time FFT unit 71-1 to the time FFT unit 71-6 perform time FFT processing on the microphone signals supplied from the microphones 11-1 to 11-6, and obtain signals in the time frequency region obtained as a result. It is supplied to the space FFT unit 72.

The spatial FFT unit 72 performs spatial FFT processing on the signals supplied from the time FFT unit 71-1 to the time FFT unit 71-6, and the resulting spatial frequency domain signal (signal for each frequency bin). Is supplied to the filter processing unit 73.

Hereinafter, when it is not necessary to particularly distinguish between the time FFT unit 71-1 and the time FFT unit 71-6, it is also simply referred to as the time FFT unit 71.

In the time FFT unit 71, for example, STFT (Short Time Fourier Transform) or other time axis FFT processing (time Fourier transform) is performed as time FFT processing, and the microphone signal, which is a time signal, is converted into a signal in the time frequency region. To.

Further, in the space FFT unit 72, the time frequency in the space frequency domain is performed by performing the FFT process (spatial Fourier transform) on the space axis as the space FFT process for the signal in the time frequency domain obtained by the time FFT unit 71. Get the signal of.

The filter processing unit 73 filters the signal supplied from the spatial FFT unit 72 in the spatial frequency domain, and supplies the speaker signal for each frequency bin obtained as a result to the spatial inverse FFT unit 74.

In the spatial FFT unit 72, since the signal of the time frequency in the spatial frequency domain is obtained by the spatial FFT processing, the filtering in the filter processing unit 73 is a multiplication of the frequency axes, which is a large calculation compared with the multipoint control MIMO system of FIG. The amount can be reduced.

Such operations in the spatial frequency domain are often used for sound field reproduction by wave field synthesis. For example, "Sascha Spors and Herbert Buchner," Efficient Massive Multichannel Active Noise Control using Wave-Domain Adaptive Filtering ", 2008 3rd International. Symposium on Communications, Control and Signal Processing. ”(Hereinafter referred to as Reference 1) and the like are described in detail.

The spatial inverse FFT 74 performs the spatial inverse FFT processing, that is, the inverse transform of the spatial FFT processing on the speaker signal in the spatial frequency domain supplied from the filter processing unit 73, and the speaker in the temporal frequency domain of each channel obtained as a result. The signal is supplied to the time reverse FFT unit 75-1 to the time reverse FFT unit 75-6.

The time inverse FFT unit 75-1 to the time inverse FFT unit 75-6 perform the time inverse FFT process, that is, the inverse transform of the time FFT process on the speaker signal in the time frequency domain supplied from the spatial inverse FFT 74, and the result is The obtained speaker signal in the time domain of each channel is supplied to the speaker 13-1 to the speaker 13-6.

Hereinafter, when it is not necessary to distinguish the time reverse FFT unit 75-1 to the time reverse FFT unit 75-6, it is also simply referred to as a time reverse FFT unit 75.

Such a spatial frequency domain processing system can be used when the microphone 11 and the speaker 13 are arranged in an array in a specific shape such as an annular shape, and can be used in a space with a large number of channels and a wide frequency band with a low calculation amount. NC can be realized.

That is, in the spatial frequency domain processing system, the correlation between channels is taken into consideration, so that higher spatial NC performance can be obtained up to higher frequencies than in the parallel SISO system, and moreover, the calculation is performed more than in the multipoint control MIMO system. The amount can be kept low.

Further, in the spatial frequency domain processing system, a wide region can be controlled, that is, a desired wavefront can be formed with high accuracy in a wide region, so that spatial NC can be performed for a wide region. Further, in the spatial frequency domain processing system, the adaptive processing of the filter used in the filter processing unit 73 converges quickly, so that the spatial NC that follows the environmental change can be performed.

Here, the filtering in the spatial frequency region in the filter processing unit 73 will be described more specifically.

In the following, assuming that all processing is performed on a computer, what should be described as a discrete Fourier transform (DFT (Discrete Fourier Transform)) will be simply referred to as a furi transform. ..

For example, a microphone array consisting of a plurality of microphones arranged at equal intervals in a ring shape centered on the origin of the xy coordinate system, which is a predetermined two-dimensional Cartesian coordinate system set in space, and the origin of the xy coordinate system. Let us consider a speaker array consisting of a plurality of speakers arranged in a ring at equal intervals.

Here, it is assumed that the number of elements of the microphone array and the speaker array, that is, the number of microphones constituting the microphone array and the number of speakers constituting the speaker array are M respectively.

The radius of the microphone array, that is, the distance from the center position (origin position) of the microphone array to the microphone is R _mic , and the radius of the speaker array is R _spc .

At this time, if the index (microphone index) m indicating the microphones constituting the microphone array is m = 0,1, ..., M-1, the coordinates (x, y) indicating the arrangement position of each microphone constituting the microphone array. ) Is as shown in the following equation (1).

The coordinates indicating the arrangement position of each speaker constituting the speaker array can be expressed in the same manner as the coordinates indicating the arrangement position of each microphone.

Also, let n be the discrete-time index (time index), and let x [m, n] be the microphone signal in the time domain obtained by the m-th microphone that constitutes the microphone array. Similarly, let y [m, n] be the speaker signal in the time domain of the mth speaker constituting the speaker array.

Assuming that the microphone signal in the time frequency domain obtained by performing a Fourier transform on the microphone signal x [m, n] in the time domain is X [m, k], these microphone signals x [m, n] and the microphone The relationship with the signal X [m, k] is shown in the following equation (2).

Note that k in the microphone signal X [m, k] is an index indicating the time frequency, and N in the equation (2) indicates the time Fourier transform length.

Assuming that the speaker signal in the time frequency domain obtained by performing a Fourier transform on the speaker signal y [m, n] in the time domain is Y [m, k], the speaker signal y [m, n] and the speaker signal Y The relationship of [m, k] can also be expressed by the same equation as in equation (2).

The spatial Fourier transform is defined in the same way as the time Fourier transform.

That is, in the equation (2), the Fourier transform is performed with respect to the time index n, but in the spatial Fourier transform, the Fourier transform is performed with respect to the microphone index m.

For example, let l be the index indicating the spatial frequency, that is, the index of the frequency bin of the spatial frequency, and the signal in the spatial frequency region obtained by performing the spatial Fourier transform on the microphone signal X [m, k] is X'[l, k. ]. In this case, the relationship between the microphone signal X [m, k] and the signal X'[l, k] is shown in the following equation (3).

Assuming that the speaker signal in the spatial frequency domain obtained by performing the spatial Fourier transform on the speaker signal Y [m, k] is Y'[l, k], the speaker signal Y [m, k] and the speaker The relationship of the signal Y'[l, k] can also be expressed by the same equation as in equation (3).

According to Reference 1 above, filtering in the spatial frequency domain is performed on the signal X'[l, k] in the spatial frequency domain, and this filtering process filters the spatial frequency domain W'[. If l, k], it is expressed by the following equation (4). That is, the speaker signal Y'[l, k] can be obtained by calculating the equation (4) based on the signal X'[l, k] and the filter W'[l, k].

In the filter processing unit 73 of the spatial frequency domain processing system shown in FIG. 3, the filtering represented by the equation (4) is performed to generate a speaker signal in the spatial frequency domain.

However, it is difficult to use the filtering shown in Eq. (4) in the spatial NC system.

The reason is that the time Fourier transform is required to perform the filtering shown in the equation (4), and the time Fourier transform causes an unavoidable system delay due to the block processing.

Specifically, for example, in the time Fourier transform shown in Eq. (2), if the time Fourier transform length N is 512 and the time Fourier transform of 512 samples is performed, a delay of about 10 msec occurs, but this delay time depends on the speed of sound. When converted to distance, the delay is 3 m or more. Therefore, even if an attempt is made to realize spatial NC by the filtering shown in the equation (4), it is actually difficult to obtain high spatial NC performance.

Therefore, consider transforming the equation (4) so that the processing equivalent to the equation (4) can be realized without using the time Fourier transform.

For example, if the filter length of the filter W'[l, k] in the spatial frequency domain is N _f and the inverse time Fourier transform and the reciprocal space Fourier transform are performed on both sides of the equation (4), the following equation (5) is obtained. Be done.

In equation (5), w [m', n'] represents a time domain filter for a microphone with a microphone index m'corresponding to the filter W'[l, k], and (P) _Q. Represents P “mod” Q, that is, the remainder of P divided by Q.

Therefore, the equation (5) expresses the relationship between the time domain filter w [m, n] and the microphone signal x [m, n] and the time domain speaker signal y [m, n].

Filtering based on equation (5) can be used in spatial NC because it is not affected by the system delay due to the time Fourier transform, but the amount of computation is the same as the above-mentioned multipoint control MIMO system, and many computations are performed. It will be necessary.

The multipoint control MIMO system is described in detail in, for example, "C. Hansen, et al.," Active Control of Noise and Vibration ", CRC press, 2012." (hereinafter referred to as Reference 2). ing.

On the other hand, paying attention to the commutativity of the multidimensional Fourier transform, consider performing only the inverse time Fourier transform without performing the reciprocal space Fourier transform on both sides of the equation (4).

Here, as shown in the following equation (6), to the spatial Fourier transform (DFT) in which the total number M of the microphones is the DFT point length for the microphone signal x [m, n] in the time domain, that is, to the signal in the spatial frequency domain. The signal obtained by performing the conversion (spatial frequency conversion) of is defined as x'[l, n].

That is, in the equation (6), the conversion process in which only the spatial Fourier transform (spatial frequency conversion) is performed without performing the time Fourier transform (time-frequency conversion) for the microphone signal x [m, n], that is, the time domain. The process of converting the microphone signal x [m, n] into the frequency domain only in the spatial direction is performed. Therefore, it can be said that the signal x'[l, n] obtained by the equation (6) is a time signal in the spatial frequency domain.

Similar to the signal x'[l, n], the time domain filter w [m, n] is not subjected to the time Fourier transform (temporal frequency transform), but only the spatial Fourier transform (spatial frequency transform). The filter of the spatial frequency domain (frequency bin of each spatial frequency) obtained by this is defined as w'[l, n].

Further, the spatial frequency domain (each) obtained by performing only the spatial Fourier transform (spatial frequency conversion) without performing the temporal Fourier transform (temporal frequency transform) for the speaker signal y [m, n] in the temporal domain. The speaker signal of the frequency bin of the spatial frequency) is defined as y'[l, n]. More specifically, the speaker signal y'[l, n] is not a speaker drive signal that drives the speaker alone, but the noise canceling sound output from each speaker is derived from this speaker signal y'[l, n]. It is calculated.

In this case, the following equation (7) can be obtained by performing only the inverse time Fourier transform without performing the reciprocal space Fourier transform on both sides of the equation (4).

In equation (7), the signal x'[l, n] is filtered by the filter w'[l, n] having a filter length N _f , that is, the filter w'[l, n] and the signal x'[l, n]. ] And the convolution process in the time direction, the speaker signal y'[l, n] can be obtained.

The filtering operation shown in equation (7) is a process in the spatial frequency domain, but does not require spatial convolution, that is, it is independent of the spatial frequency index (frequency bin) l and is convoluted in the temporal direction. All you have to do is.

Therefore, the actual amount of operation for obtaining the speaker signal y'[l, n] is the same as that in the above-mentioned parallel SISO system except for the operation of a constant multiple.

Hereinafter, a system that generates a speaker signal as a noise canceling signal by the spatial frequency conversion shown in the equation (6) and the filtering shown in the equation (7) is also referred to as a low delay spatial frequency domain processing system. do.

According to this technology, by realizing spatial NC with a low delay spatial frequency domain processing system, it is possible to reduce the delay time generated in the system and obtain higher spatial NC performance.

That is, in the low-delay spatial frequency domain processing system to which this technology is applied, the filtering operation is a convolution operation instead of a simple multiplication as in the spatial frequency domain processing system, but since SISO filtering is sufficient, multipoint control is required. Compared to the MIMO system, the amount of computation is significantly reduced.

Moreover, in the low delay spatial frequency domain processing system, since it is not necessary to perform the Fourier transform (FFT) on the time axis, the delay time (delay) generated in the system is extremely reduced. Also, in the low delay spatial frequency domain processing system, spatial FFT (spatial frequency conversion) is performed, but since the output (mic signal) of the microphone at the same time is only FFTed at once, the microphone signal is temporally generated. There is no need for buffering and there is little delay.

Therefore, the low-delay spatial frequency domain processing system becomes a realistic system with low delay and a small amount of computation, and can realize a higher-performance spatial NC.

<Configuration example of noise canceling device>
FIG. 4 is a diagram showing a configuration example of a noise canceling device which is an example of an embodiment of a low delay spatial frequency domain processing system to which the present technology is applied.

The noise canceling device 101 shown in FIG. 4 has a microphone array 111, a signal processing device 112, and a speaker array 113.

The microphone array 111 is a microphone array such as an annular microphone array obtained by arranging the microphones 121-1 to 121-16 in a predetermined shape such as an annular shape or a rectangular shape.

The microphones 121-1 to 121-16 collect ambient sounds including noise to be canceled, and supply the resulting microphone signal to the signal processing device 112. Hereinafter, when it is not necessary to distinguish between the microphones 121-1 and the microphones 121-16, they are simply referred to as the microphones 121.

The signal processing device 112 comprises, for example, a personal computer having one or more arithmetic units, and generates a speaker signal in the time domain for spatial NC based on the microphone signal supplied from the microphone array 111, and the speaker array. Output to 113.

The speaker signal in this time domain is a noise canceling signal for spatial NC, and is a speaker driving signal that drives the speakers constituting the speaker array 113 to output a noise canceling sound.

The signal processing device 112 has a signal processing unit 131 including one arithmetic unit such as a DSP (Digital Signal Processor) or an FPGA (Field Programmable Gate Array).

The signal processing unit 131 has a spatial frequency conversion unit 141, a filter processing unit 142-1 to a filter processing unit 142-16, and a spatial frequency synthesis unit 143.

The spatial frequency conversion unit 141 performs spatial frequency conversion on the microphone signal in the time region supplied from the microphones 121-1 to 121-16, that is, the time signal, and filters the signal in the spatial frequency region obtained as a result. It is supplied to the processing unit 142-1 to the filter processing unit 142-16. In other words, the spatial frequency conversion unit 141 converts the microphone signal in the time domain in the spatial frequency domain.

For example, in the spatial frequency conversion unit 141, the DFT shown in the equation (6) is performed as the spatial frequency conversion based on the microphone signals supplied from all the microphones 121.

In particular, in this example, the total number M of the microphones 121 is set to "16", and the signal x'[l] for the frequency bin l of each of the 16 spatial frequencies corresponding to the filter processing unit 142-1 to the filter processing unit 142-16. , N] is calculated.

The filter processing unit 142-1 to the filter processing unit 142-16 generate a speaker signal in the spatial frequency region by performing signal processing in the spatial frequency region with respect to the signal supplied from the spatial frequency conversion unit 141. It is supplied to the spatial frequency synthesis unit 143.

That is, the filter processing unit 142-1 to the filter processing unit 142-16 hold an SISO filter for the spatial NC, and the SISO filter filters the signal in the spatial frequency domain from the spatial frequency conversion unit 141. , Performed as signal processing in the spatial frequency domain. More specifically, the process of convolving the filter coefficients constituting the SISO filter and the signal in the spatial frequency region is performed as filtering by the SISO filter.

Specifically, for example, the filter processing unit 142-1 to the filter processing unit 142-16 hold the above-mentioned filter w'[l, n] as an SISO filter, and are represented by the equation (7) as filtering. The calculation is performed and the speaker signal y'[l, n] is generated.

Hereinafter, when it is not necessary to distinguish between the filter processing unit 142-1 and the filter processing unit 142-16, the filter processing unit 142 is also simply referred to as the filter processing unit 142.

For example, when the filtering (signal processing) shown in the equation (7) is performed by the filter processing unit 142, one filter processing unit 142 performs filtering for one frequency bin l of the spatial frequency. Become.

The SISO filter held by the filter processing unit 142 is, for example, a FIR (Finite Impulse Response) filter generated in advance by LMS (Least Mean Squares) or the like based on the shape of the microphone array 111 or the total number of microphones 121. To.

In the filter processing unit 142, a SISO filter prepared in advance may be continuously used, or an SISO filter may be used based on a microphone signal obtained by collecting sound by a microphone or the like installed at a control point. May be updated sequentially.

The spatial frequency synthesis unit 143 generates a speaker signal in the time region for each speaker by performing spatial frequency synthesis on the speaker signal in the spatial frequency region supplied from each filter processing unit 142, and supplies the speaker signal to the speaker array 113. do.

For example, in the spatial frequency synthesis unit 143, the inverse conversion of the spatial frequency conversion performed by the spatial frequency conversion unit 141 is performed as the spatial frequency synthesis. Therefore, for example, when the DFT (spatial Fourier transform) shown in the equation (6) is performed by the spatial frequency transforming unit 141, the spatial frequency synthesizing unit 143 has an IDFT (Inverse Discrete Fourier Transform) (reverse) corresponding to the equation (6). Spatial Fourier transform) is performed.

The speaker array 113 is a speaker array such as an annular speaker array obtained by arranging the speakers 151-1 to 151-16, which are speaker units, in a predetermined shape such as an annular shape or a rectangular shape. ..

Speakers 151-1 to 151-16 are driven based on the speaker signal in the time domain supplied from the spatial frequency synthesis unit 143, and output a noise canceling sound. As a result, the noise sound is canceled in the predetermined target area, and the spatial NC is realized.

Hereinafter, when it is not necessary to distinguish between the speaker 151-1 and the speaker 151-16, the speaker 151-1 is also simply referred to as the speaker 151.

Here, an example of arrangement of the microphone array 111 and the speaker array 113 will be described with reference to FIG. In FIG. 5, the parts corresponding to the case in FIG. 4 are designated by the same reference numerals, and the description thereof will be omitted. Further, in FIG. 5, in order to make the figure easier to see, the reference numerals of the microphone 121 and the speaker 151 are not attached.

In the example of FIG. 5, the user U11 who is a listener or the like listening to the content is in the predetermined area R11, and this area R11 is the area (cancellation area) targeted by the spatial NC.

Further, the speakers 151 constituting the speaker array 113 are arranged in a ring shape so as to surround the area R11 which is a cancellation area to form an annular speaker array.

Further, the microphones 121 constituting the microphone array 111 are arranged in a ring shape on the outside of the speaker array 113 so as to surround the speaker array 113 to form a ring-shaped microphone array.

Here, the speaker array 113 and the microphone array 111 are arranged so that the center of the speaker array 113 and the microphone array 111 is at the center position of the circular region R11.

In the noise canceling device 101, the microphone array 111 arranged outside the speaker array 113 when viewed from the region R11 collects noise (noise sound) generated outside the microphone array 111 and propagated to the region R11. Will be done.

Then, a speaker signal is generated based on the microphone signal obtained by collecting the sound, and the noise canceling sound based on the speaker signal is output in the direction of the region R11 from each speaker 151 constituting the speaker array 113. Then, the wavefront of the noise canceling sound output from each speaker 151 is combined to form a wavefront that cancels (cancels) the noise sound in the region R11. As a result, spatial NC by wave field synthesis is realized.

Here, an example in which the number of microphones 121 constituting the microphone array 111 and the number of speakers 151 constituting the speaker array 113 are the same, and the microphone array 111 and the speaker array 113 have the same shape (annular shape) will be described. did.

However, the numbers of the microphone 121 and the speaker 151 and the shapes of the microphone array 111 and the speaker array 113 do not necessarily have to be the same, and may be different numbers and shapes. For example, when the numbers of the microphone 121 and the speakers 151 are different, the spatial frequency conversion unit 141 or the spatial frequency synthesis unit 143 may perform upsampling or downsampling on the signal in the spatial frequency domain according to the numbers. good.

Further, the number of microphones 121 and speakers 151 may be any number, and the shape (array shape) of the microphone array 111 and speaker array 113 may be any shape.

<Explanation of noise canceling processing>
Subsequently, the operation of the noise canceling device 101 will be described. That is, the noise canceling process by the noise canceling device 101 will be described below with reference to the flowchart of FIG.

In step S11, each microphone 121 of the microphone array 111 collects ambient sound, and supplies the microphone signal in the time domain obtained as a result to the spatial frequency conversion unit 141.

In step S12, the spatial frequency conversion unit 141 performs spatial frequency conversion on the microphone signal in the time region supplied from the microphone 121, and supplies the signal in the spatial frequency region obtained as a result to each filter processing unit 142. For example, in step S12, the calculation of the above equation (6) is performed, and a signal in the spatial frequency region is generated.

In step S13, the filter processing unit 142 filters the signal in the spatial frequency domain supplied from the spatial frequency conversion unit 141 by the holding SISO filter, and obtains the speaker signal in the spatial frequency domain obtained as a result. It is supplied to the spatial frequency synthesis unit 143. For example, in step S13, the calculation of the equation (7) is performed as filtering.

In step S14, the spatial frequency synthesis unit 143 performs spatial frequency synthesis on the spatial frequency domain speaker signal supplied from the filter processing unit 142, and generates a temporal domain speaker signal.

In step S15, the spatial frequency synthesis unit 143 supplies the speaker signal obtained in the process of step S14 to each speaker 151 of the speaker array 113 to output a sound (noise canceling sound).

As a result, a wavefront that cancels noise sounds is formed in the cancel area, and spatial NC is realized. When the spatial NC is performed in this way, the noise canceling process ends.

As described above, the noise canceling device 101 performs spatial frequency conversion on the microphone signal in the time domain without performing time frequency conversion, and the speaker signal is based on the signal in the spatial frequency region obtained as a result. To generate.

In this way, by generating a speaker signal by signal processing in the spatial frequency domain without performing time-frequency conversion and its inverse conversion, not only the amount of calculation can be significantly reduced, but also the delay time can be reduced. , High spatial NC performance can be obtained in real time.

<Second embodiment>
<Processing in the spatial frequency domain>
By the way, in the noise canceling device 101 shown in FIG. 4, in order for the signal processing unit 131 to perform signal processing in the spatial frequency region, the spatial frequency conversion unit 141 first performs spatial frequency conversion, and the microphone signal in the time region is performed. Is converted into a signal in the spatial frequency region.

At this time, the outputs (microphone signals) of all the microphones 121 constituting the microphone array 111 must be input to the signal processing unit 131 for spatial frequency conversion (DFT). Further, even after filtering in the spatial frequency domain, it is necessary to perform spatial frequency synthesis using the outputs of all the filter processing units 142.

Therefore, in the noise canceling device 101, one arithmetic unit as the signal processing unit 131 must perform spatial frequency conversion, signal processing (filtering) in the spatial frequency region, and spatial frequency synthesis. That is, it is not possible to divide the hardware that performs these processes and share the processes with a plurality of hardware (arithmetic units) (distribute the processes).

In FIG. 5, for the sake of simplicity, an example of a 16-point FFT, that is, an example in which the DFT point length M (total number M of microphones 121) in the equation (6) is 16.

In the spatial NC, the narrower the distance between the microphones 121 arranged in the array shape and the distance between the speakers 151, the higher the frequency can be controlled, that is, the noise sound can be canceled up to the higher frequency.

Further, when it is desired to make the area R11 which is the cancellation area wider, it is necessary to increase the number of microphones 121 and speakers 151 as compared with the case where the control is performed up to the same frequency in the narrower area R11.

As a specific example, for example, when the upper limit of the frequency targeted for noise canceling is set to 1 kHz and the cancel area (region R11) is to be a region with a diameter of 2 m, in order to obtain sufficient spatial NC performance. Requires 40 or more microphones 121 and 151 speakers. Therefore, in order to perform spatial NC for a higher frequency or a wider area, the required microphone 121 or speaker 151 may exceed several hundreds.

In such a case, inputting the outputs of all the microphones 121 to one arithmetic unit (arithmetic unit) such as one DSP or FPGA as the signal processing unit 131 is the number of PINs (input terminals) of the input and output provided in the arithmetic unit. And the number of output terminals), it may be physically impossible.

Even if the number of PINs is sufficient, if the number of microphones 121 and speakers 151 increases, the number of filter processing units 142 (SISO filters) required increases accordingly, and the amount of calculation in the entire signal processing unit 131 increases. Will increase.

Then, one signal processing unit 131 may not be able to perform the calculation (processing) because the amount of calculation is too large, and it may not be possible to realize the spatial NC.

Therefore, for example, a plurality of microphones 121 and speakers 151 constituting the microphone array 111 and the speaker array 113 are divided into a plurality of groups, and spatial frequency conversion, filtering in the spatial frequency region, and spatial frequency synthesis are performed for each of the divided groups. May be done.

By doing so, the calculation for spatial NC can be shared by multiple arithmetic units (arithmetic units), so it is high while reducing the number of PINs and the amount of computation required for one arithmetic unit. Spatial NC performance can be obtained.

For example, as shown in FIG. 7, consider a case where a noise sound is generated with the position P11 outside the microphone array 111 as the sound source position of one point sound source. In FIG. 7, the same reference numerals are given to the portions corresponding to those in FIG. 5, and the description thereof will be omitted as appropriate.

In this case, in the noise canceling device 101 described above, all the microphones 121 constituting the microphone array 111 and all the speakers 151 constituting the speaker array 113 are used in order to perform spatial NC.

However, when performing spatial NC, it may not always be necessary to use all microphones 121 and speakers 151 in some cases, such as when there is only one sound source for noise.

For example, when considering the noise of one point sound source as in the example of FIG. 7, the degree (contribution rate) used for sound collection and sound output for each microphone 121 and speaker 151, that is, for the realization of spatial NC. The importance of is different.

Specifically, in the example of FIG. 7, the importance is as high as the microphone 121 and the speaker 151 arranged at a position close to the position P11 where the sound source of the noise sound is located, and conversely, the microphone 121 arranged at a position far from the position P11. And the speaker 151 become less important.

Therefore, in some cases, it is possible to perform spatial NC with sufficient performance without necessarily using all microphones 121 and speakers 151.

In this example, sufficient spatial NC performance can be obtained even if only the microphone 121 and the speaker 151 arranged within the range indicated by the arrow Q11, which are located near the position P11, are used.

Therefore, for example, as shown in FIG. 8, it is also possible to perform spatial NC using only four microphones 121 and 12 speakers 151 near the position P11 where the noise source is located. In FIG. 8, the parts corresponding to the case in FIG. 7 are designated by the same reference numerals, and the description thereof will be omitted as appropriate.

In the example of FIG. 8, the number of microphones 121 used for spatial NC is four, while the number of speakers 151 used for spatial NC is twelve. Therefore, in order to generate a speaker signal for spatial NC by the same calculation as in the case of the noise canceling device 101, the microphone signals of the eight microphones 121 are further required.

Therefore, although it was not actually obtained by picking up the sound, if the zero signal is used as the microphone signal of the remaining eight microphones 121, each of the twelve will be calculated in the same manner as in the case of the noise canceling device 101. The speaker signal of the speaker 151 can be generated.

By doing so, the noise sound generated at the position P11 can be sufficiently canceled without using all the microphones 121 and the speakers 151.

However, in this case, the noise sound generation position (noise source position) must be near the position P11, and noise sound from all directions cannot be dealt with.

However, for example, as shown in FIG. 9, if the microphone 121 constituting the microphone array 111 and the speaker 151 constituting the speaker array 113 are each divided into four groups and a speaker signal is generated for each group, the whole is completed. It is possible to deal with noise sounds from the direction. In FIG. 9, the parts corresponding to the case in FIG. 7 are designated by the same reference numerals, and the description thereof will be omitted as appropriate.

In this example, the 16 microphones 121 constituting the microphone array 111 and the 16 speakers 151 constituting the speaker array 113 are each divided into four groups. In the following, the group of microphones 121 will also be referred to as a microphone group, and the group of speakers 151 will also be referred to as a speaker group.

In this example, the 16 microphones 121 constituting the microphone array 111 are divided into four microphone groups as shown by arrows Q21 to Q24.

In particular, here, the microphone groups are grouped so that one microphone 121 belongs to only one microphone group and the microphones 121 arranged adjacent to each other belong to one microphone group. In this example, one microphone group consists of four microphones 121.

Specifically, as shown by arrow Q21, one microphone group is formed by four microphones 121 arranged adjacent to each other on the right front side when viewed from the user U11. Similarly, as shown by arrows Q22 to Q24, a microphone group is formed by four microphones 121 arranged adjacent to each other in the right rear, left rear, and left front directions when viewed from the user U11. Has been done.

Further, for such four microphone groups, four speaker groups corresponding to those microphone groups are provided.

That is, as shown by arrow Q21, a speaker group consisting of 12 speakers 151 arranged adjacent to each other with the position in front of the right side of the user U11 as the center with respect to the microphone group on the right front side when viewed from the user U11. Is formed.

Similarly, as shown by arrows Q22 to Q24, the right rear, left rear, and left rear of the user U11 are also for the microphone groups in the right rear, left rear, and left front directions when viewed from the user U11. A speaker group consisting of 12 speakers 151 arranged adjacent to each other is formed around the position in each direction on the front left side.

In this example, since the speaker groups are grouped so that 12 speakers 151 arranged adjacent to each other belong to one speaker group, one speaker 151 belongs to three speaker groups. It will be.

If such grouping is performed and the microphone signal of the microphone 121 belonging to one microphone group is used for each group to generate the speaker signal of the speaker 151 belonging to one speaker group corresponding to the microphone group, the space can be generated. The entire processing of NC can be divided into four. That is, for example, the hardware is divided into four by providing one arithmetic unit corresponding to the signal processing unit 131 for the corresponding microphone group and the speaker group, and the processing for spatial NC is performed by the plurality of arithmetic units. Can be dispersed.

In this example, a plurality of microphones 121 constituting the microphone array 111 are divided into four microphone groups so that four microphones 121 adjacent to each other belong to the same microphone group. That is, the microphone 121 to be used for one filtering is selected while shifting the microphone 121 by four.

Similarly, while shifting the speaker 151 by four, the speaker 151 that is the output destination of the speaker signal obtained by one filtering is selected, and the speaker signal is supplied to the selected speaker 151.

At this time, since one and the same speaker 151 is selected as the output destination of each speaker signal obtained by filtering three times different from each other, the three speaker signals having the same speaker 151 as the output destination are added and finally. It is regarded as a speaker signal.

As described above, grouping is performed to divide the entire processing part by part, and the speaker signals are added to the part where the filtering output destinations overlap to obtain the final speaker signal, resulting in all microphones. Spatial NC can be performed using 121 and speaker 151. This makes it possible to deal with noise sounds from all directions.

<Configuration example of noise canceling device>
When the microphone 121 and the speaker 151 are grouped and the processing is shared among the groups, the noise canceling device is configured as shown in FIG. 10, for example. In FIG. 10, the same reference numerals are given to the portions corresponding to those in FIG. 4, and the description thereof will be omitted as appropriate.

The noise canceling device 191 shown in FIG. 10 has a microphone array 111, a signal processing device 201, and a speaker array 113.

In this example, the microphone array 111 and the speaker array 113 are each divided into four groups.

That is, microphones 121-1 to 121-4 are grouped together.

Similarly, each of the microphones 121-5 to 121-8, the microphones 121-9 to 121-12, and the microphones 121-13 to 121-16 is grouped together.

Further, the speakers 151-1 to the speakers 151-12 are grouped together, and the speakers 151-5 to the speakers 151-16 are also grouped together.

Similarly, the speakers 151-9 to the speaker 151-16 and the speakers 151-1 to the speaker 151-4 are grouped together, and the speakers 151-13 to the speaker 151-16 and the speakers 151-1 to the speaker 1511 are grouped together. -8 is one group.

The signal processing device 201 corresponds to the signal processing device 112 of FIG. 4, and is composed of, for example, a personal computer having one or a plurality of arithmetic units.

The signal processing device 201 has a signal processing unit 211-1 to a signal processing unit 211-4, and an addition unit 212-1 to an addition unit 212-16.

Each of the signal processing unit 211-1 to the signal processing unit 211-4 is composed of one arithmetic unit such as a DSP or FPGA, and corresponds to the signal processing unit 131 of FIG.

The signal processing unit 211-1 is the same as in the signal processing unit 131 based on the microphone signals supplied from the microphones 121-1 to 121-4 and the predetermined eight zero signals treated as microphone signals. Performs processing and generates a speaker signal.

The signal processing unit 211-1 generates speaker signals of each of the twelve channels, that is, speaker signals having the output destinations of the speakers 151-13 to the speaker 151-16 and the speakers 151-1 to 151-8 respectively. ..

The signal processing unit 211-1 supplies the generated speaker signal to the addition unit of the corresponding channel, that is, the addition unit 212-13 to the addition unit 212-16 and the addition unit 212-1 to the addition unit 212-8.

Similar to the signal processing unit 211-1, the signal processing unit 211-2 to the signal processing unit 211-4 also have 12 channels of speaker signals based on the microphone signals from the four microphones 121 and the eight zero signals. Is generated and output.

That is, the signal processing unit 211-2 receives the microphone signal from the microphones 121-5 to 121-8, and supplies the speaker signal to the addition unit 212-1 to the addition unit 212-12.

The signal processing unit 211-3 receives the microphone signal from the microphones 121-9 to 121-12, and supplies the speaker signal to the addition unit 212-5 to the addition unit 212-16.

The signal processing unit 211-4 receives the microphone signal from the microphones 121-13 to 121-16, and is supplied to the addition unit 212-9 to the addition unit 212-16, and the addition unit 212-1 to the addition unit 212-4. Supply a speaker signal.

Hereinafter, when it is not necessary to distinguish between the signal processing unit 211-1 and the signal processing unit 211-4, the signal processing unit 211-1 to the signal processing unit 211-4 are also simply referred to as the signal processing unit 211.

In the noise canceling device 191, a signal processing unit 211 to which the microphone signal obtained by collecting the sound by the microphone 121 of the microphone group is input is predetermined for each microphone group.

Each signal processing unit 211 performs filtering by an SISO filter based on the microphone signals supplied from all the microphones 121 belonging to one microphone group, and corresponds to a part of the speakers 151 of the speaker array 113, that is, the microphone group. The speaker signal of the speaker 151 belonging to the speaker group is generated.

The addition unit 212-1 to the addition unit 212-16 add speaker signals of the same channel supplied from a plurality of signal processing units 211 to obtain a final speaker signal, and the final speaker signal of the corresponding channel. It is supplied to the speaker 151.

In this example, the addition unit 212-1 to the addition unit 212-4 receive the speaker signal supply from the signal processing unit 211-1, the signal processing unit 211-2, and the signal processing unit 211-4, and the addition unit 212- 5 to the addition unit 212-8 receive a speaker signal from the signal processing unit 211-1 to the signal processing unit 211-3.

The addition unit 212-9 to the addition unit 212-12 receive a speaker signal from the signal processing unit 211-2 to the signal processing unit 211-4, and the addition unit 212-13 to the addition unit 212-16 are signal processing units. The speaker signal is supplied from 211-1, the signal processing unit 211-3, and the signal processing unit 211-4.

Hereinafter, when it is not necessary to distinguish between the addition unit 212-1 and the addition unit 212-16, it is also simply referred to as the addition unit 212.

In this example, each of a plurality of corresponding addition units 212 is provided for each of the plurality of speakers 151 constituting the speaker array 113, and the addition units 212 are obtained by two or more signal processing units 211. The speaker signals of the same speaker 151 are added and output.

Further, although an example in which a plurality of signal processing units 211 are provided in one signal processing device 201 has been described here, each of these signal processing units 211 may be provided in a plurality of signal processing devices different from each other. ..

<Configuration example of signal processing unit>
FIG. 11 is a diagram showing a configuration example of the signal processing unit 211 of the noise canceling device 191.

The signal processing unit 211 has a spatial frequency conversion unit 241, a filter processing unit 242-1 to a filter processing unit 242-12, and a spatial frequency synthesis unit 243.

The spatial frequency conversion unit 241, the filter processing unit 242-1 to the filter processing unit 242-12, and the spatial frequency synthesis unit 243 are the spatial frequency conversion unit 141, the filter processing unit 142, and the spatial frequency synthesis unit shown in FIG. Corresponds to part 143.

The spatial frequency conversion unit 241 performs spatial frequency conversion based on the time domain microphone signals supplied from each of the four microphones 121 and the eight zero signals supplied as dummy microphone signals.

For example, in the spatial frequency conversion unit 241, the same DFT as in the equation (6) is performed as the spatial frequency conversion. In this case, the DFT point length is set to 12, and the signal x'[l, n] for the frequency bin l of each of the 12 spatial frequencies corresponding to the filter processing unit 242-1 to the filter processing unit 242-12 is calculated. To.

The spatial frequency conversion unit 241 supplies the signal in the spatial frequency domain obtained by the spatial frequency conversion to the filter processing unit 242-1 to the filter processing unit 242-12.

The filter processing unit 242-1 to the filter processing unit 242-12 generate a speaker signal in the spatial frequency region by performing signal processing in the spatial frequency region with respect to the signal supplied from the spatial frequency conversion unit 241. It is supplied to the spatial frequency synthesis unit 243.

Specifically, the filter processing unit 242-1 to the filter processing unit 242-12 hold an SISO filter for spatial NC.

The filter processing unit 242-1 to the filter processing unit 242-12 generate a speaker signal by filtering the signal in the spatial frequency domain from the spatial frequency conversion unit 241 as signal processing by the holding SISO filter. .. This SISO filter is, for example, the above-mentioned filter w'[l, n], and the calculation of the equation (7) is performed as filtering.

Hereinafter, when it is not necessary to distinguish between the filter processing unit 242-1 and the filter processing unit 242-12, it is also simply referred to as the filter processing unit 242.

The spatial frequency synthesis unit 243 generates a speaker signal in the time region for each speaker 151 by performing spatial frequency synthesis on the speaker signal in the spatial frequency region supplied from each filter processing unit 242, and causes the speaker array 113. Supply.

In the spatial frequency synthesis unit 243, the inverse conversion of the spatial frequency conversion performed by the spatial frequency conversion unit 241 is performed as the spatial frequency synthesis.

As described above, in the noise canceling device 191, the outputs of the plurality of microphones 121 constituting the microphone array 111 are divided into four and input to each signal processing unit 211.

In the signal processing unit 211, each microphone signal supplied from the four microphones 121 is input to each of the four input terminals in the center of the twelve input terminals. Further, a zero signal, which is a dummy microphone signal, is input to each of a total of eight input terminals, four on each of the four input terminals on the left and right ends.

In the spatial frequency conversion unit 241, DFT is performed as spatial frequency conversion, for example, with a DFT point length "12" based on the microphone signal input from the input terminal, and then in each filter processing unit 242, the DFT output is subjected to. Filtering by SISO filter is performed.

Further, in the spatial frequency synthesis unit 243, for example, IDFT is performed as spatial frequency synthesis for the output of each filter processing unit 242, and a speaker signal in the time domain of each channel is generated.

The speaker signal of each channel generated in this way is input to the speaker 151 corresponding to those channels, but before that, in each addition unit 212, the same channel from three signal processing units 211 adjacent to each other is used. Speaker signal is added.

Although an amplifier (not shown) is provided in front of the speaker 151, the addition processing of the speaker signals of the same channel may be performed in the amplifier, or before the speaker signal is input to the amplifier. The addition process may be performed in a digital or analog state.

In each signal processing unit 211, the input / output of the spatial frequency conversion unit 241 and the spatial frequency synthesis unit 243, that is, the input / output (point length) of the DFT and IDFT is "12", so that the signal processing unit 131 shown in FIG. It is less than the point length "16" in the case of.

Therefore, the number of PINs (number of input / output terminals) of the signal processing unit 211 can be reduced as compared with the case of the signal processing unit 131, and the amount of calculation (signal processing) performed by the signal processing unit 211 is also small. can do.

As described above, according to the noise canceling device 191, it is possible to reduce the number of PINs and the amount of calculation of one signal processing unit 211 and also reduce the delay time, and obtain high spatial NC performance in real time. Moreover, it is possible to deal with noise sounds from all directions.

Note that, in FIG. 10, an example of changing the point length from “16” to “12” has been described for the sake of simplicity.

However, each signal is processed according to the specifications of the signal processing unit (calculator) such as the number of PINs and the number of MIPS (Million Instructions Per Second), such as dividing a 256-channel microphone signal into 12 channels for processing. The point length (number of divisions) in the section can be set arbitrarily.

<Explanation of noise canceling processing>
Next, the operation of the noise canceling device 191 will be described. That is, the noise canceling process by the noise canceling device 191 will be described below with reference to the flowchart of FIG.

Since the process of step S51 is the same as the process of step S11 of FIG. 6, the description thereof will be omitted. However, in step S51, the microphone signal obtained by each microphone 121 is supplied to the spatial frequency conversion unit 241 of the signal processing unit 211.

In step S52, the spatial frequency conversion unit 241 of each signal processing unit 211 performs spatial frequency conversion on the microphone signals in the time domain supplied from the four microphones 121 and the eight zero signals, and the space obtained as a result. A signal in the frequency domain is supplied to each filter processing unit 242. For example, in step S52, the same calculation as in the above equation (6) is performed.

In step S53, the filter processing unit 242 filters the signal in the spatial frequency domain supplied from the spatial frequency conversion unit 241 by the holding SISO filter, and obtains the speaker signal in the spatial frequency domain obtained as a result. It is supplied to the spatial frequency synthesis unit 243. For example, in step S53, the same calculation as in equation (7) is performed as filtering.

In step S54, the spatial frequency synthesis unit 243 performs spatial frequency synthesis on the spatial frequency region speaker signal supplied from the filter processing unit 242, and supplies the resulting time domain speaker signal to the addition unit 212. ..

In step S55, the addition unit 212 performs addition processing to add the speaker signals of the same channel supplied from the spatial frequency synthesis unit 243 of each of the three signal processing units 211, and obtains the final speaker signal.

In step S56, each addition unit 212 supplies the speaker signal obtained in the process of step S55 to each speaker 151 of the speaker array 113 to output a sound (noise canceling sound), and the noise canceling process ends.

As described above, the noise canceling device 191 divides the output of the microphone array 111 into four parts and inputs them to the signal processing unit 211, and each signal processing unit 211 generates a speaker signal by signal processing in the spatial frequency region. do. By doing so, it is possible to reduce the number of PINs and the amount of calculation of one signal processing unit 211 and also reduce the delay time, and realize a high-performance spatial NC that can handle all directions in real time.

<Third embodiment>
<About other examples of grouping>
In the noise canceling device 191, an addition unit 212 is provided after the signal processing unit 211 in order to share the processing with the plurality of signal processing units 211.

However, for example, as shown in FIGS. 13 and 14, if the number of microphones 121 belonging to the microphone group is increased and the outputs of the microphones 121 are overlapped with a plurality of adjacent signal processing units 211 (computing units) for input, The configuration may be such that the addition unit 212 is not provided. In FIG. 13 or FIG. 14, the parts corresponding to the case in FIG. 7 are designated by the same reference numerals, and the description thereof will be omitted as appropriate.

In the example of FIG. 13, the noise sound from the position P11 is signal-processed, that is, filtered by using 12 microphones 121, and among the speaker signals obtained as a result, the speaker signals for 4 channels are obtained. Sound is output from the four speakers 151 used.

When the combination of the microphone 121 and the speaker 151 is divided by a plurality of signal processing units 211 for processing, for example, grouping may be performed as shown in FIG.

That is, in the example shown in FIG. 14, as shown by arrows Q41 to Q44, the microphone 121 and the speaker 151 are each divided into four groups.

Specifically, as shown by arrow Q41, one microphone group is formed by twelve microphones 121 arranged adjacent to each other around the position on the right front side of the user U11. Similarly, as shown by arrows Q42 to Q44, a microphone group is formed by twelve microphones 121 arranged adjacent to each other around the right rear, left rear, and left front directions of the user U11. Has been done.

In particular, here, since the grouping is performed so that 12 microphones 121 arranged adjacent to each other belong to one microphone group, one microphone 121 belongs to three microphone groups. Become.

That is, as shown by arrow Q41, a speaker group consisting of four speakers 151 arranged adjacent to each other is formed with respect to the microphone group on the right front side of the user U11, centered on the position on the right side front side of the user U11. Has been done.

Similarly, as shown by arrows Q42 to Q44, the right rear, left rear, and left front of the user U11 are also for the microphone groups in the right rear, left rear, and left front directions of the user U11. A speaker group consisting of four speakers 151 arranged adjacent to each other is formed with the position in each direction as the center.

In this example, the speaker groups are grouped so that one speaker 151 belongs to only one speaker group, and the speakers 151 arranged adjacent to each other belong to one speaker group.

<Configuration example of noise canceling device>
When the microphone 121 and the speaker 151 are grouped as shown in FIG. 14, the noise canceling device is configured as shown in FIG. 15, for example. In FIG. 15, the same reference numerals are given to the portions corresponding to those in FIG. 10, and the description thereof will be omitted as appropriate.

The noise canceling device 281 shown in FIG. 15 has a microphone array 111, a signal processing device 201, and a speaker array 113. Further, the signal processing device 201 has a signal processing unit 211-1 to a signal processing unit 211-4.

The configuration of the noise canceling device 281 is different from the configuration of the noise canceling device 191 in that the addition unit 212 is not provided, and is the same configuration as the noise canceling device 191 in other respects. However, the noise canceling device 281 and the noise canceling device 191 differ in the input / output relationship between the signal processing unit 211, the microphone 121, and the speaker 151.

In this example, microphones 121-1 to 121-8 and microphones 121-13 to 121-16 are grouped together, and the microphone signals of these microphones 121 are supplied to the signal processing unit 211-1. To.

Similarly, the microphones 121-1 to 121-12 are grouped together, and the microphone signals of these microphones 121 are supplied to the signal processing unit 211-2.

Microphones 121-5 to 121-16 are grouped together, and the microphone signals of these microphones 121 are supplied to the signal processing unit 211-3.

Microphones 121-9 to 121-16 and microphones 121-1 to 121-4 are grouped together, and the microphone signals of these microphones 121 are supplied to the signal processing unit 211-4.

Therefore, in this example, the output of one microphone 121 is input to two or more, more specifically, three signal processing units 211 predetermined for the microphone 121 (microphone group). Therefore, the dummy microphone signal (zero signal) is not supplied to the spatial frequency conversion unit 241 of each signal processing unit 211, and the microphone signals of the twelve microphones 121 are input.

Further, the speakers 151-1 to 151-4 are grouped together, and the speakers 151-5 to 151-8 are also grouped together.

Similarly, speakers 151-9 to speakers 151-12 are grouped together, and speakers 151-13 to speakers 151-16 are grouped together.

In each signal processing unit 211, filtering by an SISO filter or the like is performed based on the microphone signal, and a speaker signal of a part of the speaker 151 of the speaker array 113, that is, all the speakers 151 belonging to the speaker group corresponding to the microphone group is generated. Will be done.

More specifically, in the spatial frequency synthesis unit 243, the same number of 12 channels of speaker signals as the input of the spatial frequency conversion unit 241, that is, speaker signals corresponding to each of the 12 speakers 151 can be obtained. However, among these speaker signals, only the speaker signals for four channels, that is, the speaker signals of some of the speakers 151 out of the twelve speakers 151 are actually output to the speaker 151.

That is, in the spatial frequency synthesis unit 243, a speaker signal is output from each of the four output terminals in the center of the twelve output terminals to the speaker 151 connected to those output terminals. Further, since the speaker 151 is not connected to each of the eight output terminals, four on each of the four output terminals on the left and right ends, the speaker signal is supplied from these output terminals to the speaker 151. I won't get it.

In such a noise canceling device 281, 12 microphone signals are input to one signal processing unit 211 while shifting the outputs of the plurality of microphones 121 constituting the microphone array 111 by four. Therefore, the microphone signal obtained by actually collecting the sound is input to all the 12 input terminals of the spatial frequency conversion unit 241.

On the other hand, in the spatial frequency synthesis unit 243, the speaker signal is output only from four output terminals out of the twelve output terminals, and the remaining eight output terminals are not used. Therefore, for example, a part of the output of spatial frequency synthesis (IDFT) may be omitted.

The calculation for space NC performed by the noise canceling device 281 is completely equivalent to the calculation for space NC performed by the noise canceling device 191.

Therefore, in consideration of the ease of inputting the output of the microphone 121 to the signal processing unit 211 and the ease of adding (superimposing) the speaker signal supplied to the speaker 151, the noise canceling device 281 and the noise canceling device Which configuration of the ring device 191 may be selected.

For example, if it is possible to easily add or branch to an analog microphone signal, the configuration of the noise canceling device 281 may be adopted. Further, for example, if it is possible to easily add speaker signals with an analog amplifier, the configuration of the noise canceling device 191 may be adopted.

In the noise canceling device 281 as described above, basically, the noise canceling process described with reference to FIG. 6 is performed.

However, in step S11, the microphone signal obtained by each microphone 121 is supplied to the spatial frequency conversion unit 241 of the signal processing unit 211.

Then, in step S12, the spatial frequency conversion unit 241 of each signal processing unit 211 performs spatial frequency conversion, and the resulting signal is the filter processing unit 242-1 to the filter processing unit 242- of each signal processing unit 211. It is supplied to 12.

In step S13, filtering is performed by the filter processing unit 242 of each signal processing unit 211, and the speaker signal in the spatial frequency region obtained as a result is supplied to the spatial frequency synthesis unit 243 of each signal processing unit 211.

In step S14, spatial frequency synthesis is performed by the spatial frequency synthesis unit 243 of each signal processing unit 211, and the speaker signal in the time domain obtained as a result is supplied to the speaker 151 in step S15, and spatial NC is realized.

In this way, even in the noise canceling device 281, the number of PINs and the amount of calculation of one signal processing unit 211 are reduced, and the delay time is also reduced to realize a high-performance spatial NC capable of responding to all directions in real time. Can be done.

<Computer configuration example>
By the way, the series of processes described above can be executed by hardware or software. When a series of processes is executed by software, the programs constituting the software are installed on the computer. Here, the computer includes a computer embedded in dedicated hardware and, for example, a general-purpose personal computer capable of executing various functions by installing various programs.

FIG. 16 is a block diagram showing a configuration example of computer hardware that executes the above-mentioned series of processes programmatically.

In the computer, the CPU (Central Processing Unit) 501, the ROM (Read Only Memory) 502, and the RAM (Random Access Memory) 503 are connected to each other by the bus 504.

An input / output interface 505 is further connected to the bus 504. An input unit 506, an output unit 507, a recording unit 508, a communication unit 509, and a drive 510 are connected to the input / output interface 505.

The input unit 506 includes a keyboard, a mouse, a microphone, an image pickup element, and the like. The output unit 507 includes a display, a speaker, and the like. The recording unit 508 includes a hard disk, a non-volatile memory, and the like. The communication unit 509 includes a network interface and the like. The drive 510 drives a removable recording medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory.

In the computer configured as described above, the CPU 501 loads the program recorded in the recording unit 508 into the RAM 503 via the input / output interface 505 and the bus 504 and executes the above-mentioned series. Is processed.

The program executed by the computer (CPU501) can be recorded and provided on a removable recording medium 511 as a package medium or the like, for example. The program can also be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

In the computer, the program can be installed in the recording unit 508 via the input / output interface 505 by mounting the removable recording medium 511 in the drive 510. Further, the program can be received by the communication unit 509 and installed in the recording unit 508 via a wired or wireless transmission medium. In addition, the program can be pre-installed in the ROM 502 or the recording unit 508.

The program executed by the computer may be a program in which processing is performed in chronological order according to the order described in the present specification, in parallel, or at a necessary timing such as when a call is made. It may be a program in which processing is performed.

Further, the embodiment of the present technology is not limited to the above-described embodiment, and various changes can be made without departing from the gist of the present technology.

For example, this technology can take a cloud computing configuration in which one function is shared by multiple devices via a network and processed jointly.

In addition, each step described in the above flowchart can be executed by one device or shared by a plurality of devices.

Further, when a plurality of processes are included in one step, the plurality of processes included in the one step can be executed by one device or shared by a plurality of devices.

Furthermore, this technology can also have the following configurations.

(1)
It has one or more signal processing units that perform signal processing in the spatial frequency domain.
The signal processing unit is a signal processing device that performs the signal processing on a signal converted in the spatial frequency domain based on the microphone signals obtained by collecting sounds from a plurality of microphones.
(2)
The signal processing device according to (1), wherein the signal processing unit generates a noise canceling signal by performing the signal processing.
(3)
A spatial frequency conversion unit that performs spatial frequency conversion for the microphone signals in a plurality of time domains,
It is further equipped with a spatial frequency synthesizer that performs spatial frequency synthesis on the signal in the spatial frequency region obtained by the signal processing.
The signal processing device according to (1) or (2), wherein the signal processing unit performs the signal processing on a signal in the spatial frequency region obtained by the spatial frequency conversion.
(4)
The signal processing unit performs the signal processing by inputting a plurality of signals based on the plurality of microphone signals obtained by the plurality of microphones, and outputs the plurality of signals (1) to (3). The signal processing device according to paragraph 1.
(5)
The signal processing device according to any one of (1) to (4), wherein the signal processing unit has a plurality of filter processing units and performs filtering by the filter processing unit as the signal processing.
(6)
It has a plurality of the signal processing units and has a plurality of the signal processing units.
The signal processing unit performs the signal processing on the signal based on the microphone signal obtained by all the microphones belonging to one group when the plurality of microphones are divided into a plurality of groups, and performs the signal processing to the plurality of microphones. Generate a speaker signal corresponding to the speaker of some of the speakers.
Further, a plurality of adder units corresponding to each of the plurality of the speakers are provided.
The addition unit adds the speaker signals of the speaker corresponding to the addition unit obtained by two or more of the signal processing units of the plurality of signal processing units, and the final obtained by the addition. The signal processing device according to any one of (1) to (5), which outputs a specific speaker signal to the corresponding speaker.
(7)
The plurality of the microphones are divided into a predetermined number of the groups so that the plurality of microphones adjacent to each other belong to the same group.
The signal processing unit of the predetermined number to which the microphone signal obtained by the microphone belonging to the group is input is defined for each of the predetermined number of the groups (6). Signal processing device.
(8)
It has a plurality of the signal processing units and has a plurality of the signal processing units.
For each of the plurality of microphones, the microphone signal obtained by one microphone is input to two or more predetermined signal processing units among the plurality of signal processing units.
The signal processing unit performs signal processing on a signal based on the microphone signal input from the plurality of microphones to generate a speaker signal corresponding to each of the plurality of speakers, and the plurality of speakers. The signal processing device according to any one of (1) to (5), which outputs the speaker signal to some of the speakers.
(9)
A signal processing method for a signal processing device having one or more signal processing units that perform signal processing in the spatial frequency domain.
A signal processing method in which the one or a plurality of the signal processing units perform the signal processing on a signal converted in the spatial frequency domain based on the microphone signals obtained by collecting sounds from a plurality of microphones.
(10)
For a computer that controls a signal processing device having one or more signal processing units that perform signal processing in the spatial frequency domain.
A program in which the one or a plurality of the signal processing units execute a process including a step of performing the signal processing on a signal converted in the spatial frequency domain based on a microphone signal obtained by sound collection by a plurality of microphones.
(11)
With multiple microphones
One or more signal processing units that perform signal processing in the spatial frequency domain,
It is equipped with a plurality of speakers that output sound based on the noise canceling signal generated by the signal processing.
The signal processing unit performs noise canceling to generate the noise canceling signal by performing the signal processing on the signal converted in the spatial frequency region based on the microphone signals obtained by the sound collection by the plurality of microphones. Device.

111 microphone array, 112 signal processing device, 113 speaker array, 131 signal processing unit, 141 spatial frequency conversion unit, 142-1 to 142-16, 142 filter processing unit, 143 spatial frequency synthesis unit, 211-1 to 211-4 , 211 Signal processing unit, 212-1 to 212-16,212 Addition unit

Claims

It has one or more signal processing units that perform signal processing in the spatial frequency domain.
The signal processing unit is a signal processing device that performs the signal processing on a signal converted in the spatial frequency domain based on the microphone signals obtained by collecting sounds from a plurality of microphones.
The signal processing device according to claim 1, wherein the signal processing unit generates a noise canceling signal by performing the signal processing.
A spatial frequency conversion unit that performs spatial frequency conversion for the microphone signals in a plurality of time domains,
It is further equipped with a spatial frequency synthesizer that performs spatial frequency synthesis on the signal in the spatial frequency region obtained by the signal processing.
The signal processing device according to claim 1, wherein the signal processing unit performs signal processing on a signal in a spatial frequency region obtained by the spatial frequency conversion.
The signal processing device according to claim 1, wherein the signal processing unit performs the signal processing by inputting a plurality of signals based on the plurality of microphone signals obtained by the plurality of microphones and outputs the plurality of signals.
The signal processing device according to claim 1, wherein the signal processing unit has a plurality of filter processing units and performs filtering by the filter processing unit as the signal processing.
It has a plurality of the signal processing units and has a plurality of the signal processing units.
The signal processing unit performs the signal processing on the signal based on the microphone signal obtained by all the microphones belonging to one group when the plurality of microphones are divided into a plurality of groups, and performs the signal processing to the plurality of microphones. Generate a speaker signal corresponding to the speaker of some of the speakers.
Further, a plurality of adder units corresponding to each of the plurality of the speakers are provided.
The addition unit adds the speaker signals of the speaker corresponding to the addition unit obtained by two or more of the signal processing units of the plurality of signal processing units, and the final obtained by the addition. The signal processing device according to claim 1, wherein the speaker signal is output to the corresponding speaker.
The plurality of the microphones are divided into a predetermined number of the groups so that the plurality of microphones adjacent to each other belong to the same group.
The sixth aspect of claim 6, wherein each of the predetermined number of the signal processing units to which the microphone signal obtained by the microphone belonging to the group is input to each of the predetermined number of the groups. Signal processing device.
It has a plurality of the signal processing units and has a plurality of the signal processing units.
For each of the plurality of microphones, the microphone signal obtained by one microphone is input to two or more predetermined signal processing units among the plurality of signal processing units.
The signal processing unit performs signal processing on a signal based on the microphone signal input from the plurality of microphones to generate a speaker signal corresponding to each of the plurality of speakers, and the plurality of speakers. The signal processing device according to claim 1, wherein the speaker signal is output to some of the speakers.
A signal processing method for a signal processing device having one or more signal processing units that perform signal processing in the spatial frequency domain.
A signal processing method in which the one or a plurality of the signal processing units perform the signal processing on a signal converted in the spatial frequency domain based on the microphone signals obtained by collecting sounds from a plurality of microphones.
For a computer that controls a signal processing device having one or more signal processing units that perform signal processing in the spatial frequency domain.
A program in which the one or a plurality of the signal processing units execute a process including a step of performing the signal processing on a signal converted in the spatial frequency domain based on a microphone signal obtained by sound collection by a plurality of microphones.
With multiple microphones
One or more signal processing units that perform signal processing in the spatial frequency domain,
It is equipped with a plurality of speakers that output sound based on the noise canceling signal generated by the signal processing.
The signal processing unit performs noise canceling to generate the noise canceling signal by performing the signal processing on the signal converted in the spatial frequency region based on the microphone signals obtained by the sound collection by the plurality of microphones. Device.