JP2010121975A - Sound-source localizing device - Google Patents

Sound-source localizing device Download PDF

Info

Publication number
JP2010121975A
JP2010121975A JP2008293831A JP2008293831A JP2010121975A JP 2010121975 A JP2010121975 A JP 2010121975A JP 2008293831 A JP2008293831 A JP 2008293831A JP 2008293831 A JP2008293831 A JP 2008293831A JP 2010121975 A JP2010121975 A JP 2010121975A
Authority
JP
Japan
Prior art keywords
sound source
number
eigenvalue
means
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP2008293831A
Other languages
Japanese (ja)
Inventor
Norihiro Hagita
Hiroshi Ishiguro
Carlos Toshinori Ishii
Chatot Olivier
シャトッ・オリビエ
石井カルロス寿憲
浩 石黒
紀博 萩田
Original Assignee
Advanced Telecommunication Research Institute International
株式会社国際電気通信基礎技術研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced Telecommunication Research Institute International, 株式会社国際電気通信基礎技術研究所 filed Critical Advanced Telecommunication Research Institute International
Priority to JP2008293831A priority Critical patent/JP2010121975A/en
Publication of JP2010121975A publication Critical patent/JP2010121975A/en
Application status is Pending legal-status Critical

Links

Images

Abstract

A sound source localization apparatus capable of stably performing sound source localization using a MUSIC method is provided.
A sound source localization apparatus performs FFT transformation on each of a plurality of sound source signals every 200 ms, calculates a spatial correlation matrix for each frequency band and performs eigenvalue decomposition, and generates an eigenvector and an eigenvector for each of the plurality of frequency bands. An eigenvector calculation unit that calculates eigenvalues, first and second average value calculation units 120 and 122 that calculate eigenvalue profiles for the first and second frequency ranges based on the eigenvalues, and a set of these eigenvalue profiles. Based on the KNN classifier 124 that estimates the number of sound sources as a parameter by the k classification method, the number of sound sources estimated by the KNN classifier 124, information on the arrangement of microphone elements, and the eigenvector, and is equal to the number of sound sources by the MUSIC method 4 includes a sound source estimation unit that estimates the number of sound source directions.

Description

  The present invention relates to a sound source localization technique in a real environment, and more particularly, to a sound source localization technique using a MUSIC (MultiPle Signal Classification) method in a real environment.

  In voice communication between a person and a robot, the microphone attached to the robot is usually located at a distance (1 m or more). Compared to the case where the distance between the microphone and the mouth is several centimeters, such as telephone voice, the signal and noise The ratio (SNR) is low. For this reason, the voices of others nearby and the noise of the environment become interference sounds, making it difficult for the robot to recognize the target speech. Therefore, sound source localization and sound source separation are important for robot applications.

  Various studies have been conducted on sound source localization in the past. However, most of them use only simulation data or lab data, and few evaluate real-world data in which the robot operates. There are few studies to evaluate 3D sound source localization. Talking and listening while looking at the face of the utterance partner is also an important behavior for improving the interaction between humans and robots. For that purpose, three-dimensional sound source localization is also important.

  There exists a thing of patent document 1 as a prior art which assumed the real environment. The technique described in Patent Document 1 uses a famous sound source localization method called the MUSIC method with high resolution.

In the invention described in Patent Document 1, a microphone array is used, a signal from the microphone array is Fourier transformed, and a current correlation matrix is calculated based on a received signal vector obtained as a result and a past correlation matrix. The correlation matrix obtained in this way is subjected to eigenvalue decomposition to obtain a maximum eigenvalue and a noise space that is an eigenvector corresponding to an eigenvalue other than the maximum eigenvalue. Furthermore, the direction of the sound source is estimated by the MUSIC method based on the phase difference of the output of each microphone, the noise space, and the maximum eigenvalue, using one microphone in the microphone array as a reference.
JP 2008-175733 A

  The MUSIC method has a feature of high resolution, but there is a problem that the number of sound sources must be given when the MUSIC method is used. In the technique described in Patent Document 1, since it is assumed that there is one sound source, such a problem does not occur. However, the environment in which the robot actually operates is rarely such an environment, and there are always a plurality of sound sources, and the number is not constant. When the MUSIC method is used, if the number of sound sources is incorrectly predicted, sound source localization will also be incorrect, making it difficult for the robot to interact correctly with humans.

  Furthermore, in the technique described in Patent Document 1, sound source localization is performed two-dimensionally. However, the actual operating environment of the robot is not two-dimensional but three-dimensional. For example, in a shopping street or the like, a speaker is placed at a relatively high position, and sound is often played from the speaker at all times. Moreover, although the position of the speaker is constant, the volume may change. In such an environment, it is preferable to localize the sound source three-dimensionally, but the technique described in Patent Document 1 has a problem that it can be performed only two-dimensionally.

  In particular, in the case of a robot against a human being, the height of the human being varies, and in the case of an adult, the person speaks at a higher position than the robot, and the child often speaks at a lower position than the robot. From such a point, it is desirable to perform three-dimensional sound source localization.

  Furthermore, since humans move frequently, it is also necessary to track the sound source stably in real time.

  SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide a sound source localization apparatus that can stably perform sound source localization using the MUSIC method.

  Another object of the present invention is to provide a sound source localization apparatus that can estimate the number of sound sources with high accuracy in order to stably perform sound source localization using the MUSIC method.

  Still another object of the present invention is to provide a sound source localization apparatus that can estimate the number of sound sources with high accuracy and can perform tracking stably in order to stably perform sound source localization using the MUSIC method.

  A sound source localization apparatus according to the present invention includes a conversion unit for converting each of a plurality of sound source signals obtained from the output of a microphone array into frequency components of a plurality of frequency bands at predetermined time intervals, and a conversion unit. Correlation matrix calculation means for obtaining a spatial correlation matrix between frequency components for each predetermined time interval for each of a plurality of frequency bands of the sound source signals of a plurality of channels obtained by the above, and a predetermined time interval by the correlation matrix calculation means Eigenvalue decomposition for each of the spatial correlation matrices calculated for each of the plurality of frequency bands, and eigenvector calculation means for calculating eigenvectors and eigenvalues for each of the plurality of frequency bands, and eigenvector calculation means Based on eigenvalues calculated for each of a plurality of frequency bands at predetermined time intervals, the first And a unique value profile calculating means for calculating an eigenvalue profile for beauty second frequency range. Each of the first and second frequency ranges includes one or more frequency bands of the plurality of frequency bands. The sound source localization apparatus further includes a sound source number estimating unit for estimating the number of sound sources for each predetermined time interval using the set of eigen value profiles calculated for the first and second frequency ranges by the eigen value profile calculating unit as parameters. Based on the number of sound sources estimated by the sound source number estimating means, information on the arrangement of microphone elements belonging to the microphone array, and the eigenvectors calculated by the eigenvector calculating means at predetermined time intervals, the number of sound sources is equal to the number of sound sources by the MUSIC method. Sound source estimation means for estimating the number of sound source directions.

  When performing sound source localization using the MUSIC method, it is necessary to estimate the number of sound sources. Although it is known that there is a relationship between the eigenvalues of the correlation matrix calculated as above and the number of sound sources, it is calculated separately for the frequency components in the frequency bands belonging to the two frequency ranges as shown above. Experiments have shown that high accuracy can be obtained by estimating the number of sound sources using the eigenvalue profile using the eigenvalues. Since the number of sound sources can be estimated with high accuracy, sound source localization can be performed stably and with high accuracy by the MUSIC method.

  Preferably, the first frequency range and the second frequency range are continuous with each other.

  More preferably, the first frequency range and the second frequency range do not overlap each other.

  More preferably, the lower limit of the first and second frequency ranges is 1 kHz, and the upper limit is 6 kHz.

  The eigenvalue profile calculating means includes a first eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the first frequency range for each eigenvalue number at predetermined time intervals by the eigenvector calculating means, and eigenvectors. Second eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the second frequency range by the calculating means for each eigenvalue number, and first and second eigenvalue averages Means for creating and outputting an eigenvalue profile based on an average of eigenvalues calculated for each eigenvalue number by the means.

  According to the experiment, in creating the eigenvalue profile, in each of the first and second frequency ranges, the eigenvalues calculated for the frequency bands belonging to these ranges are averaged for each eigenvalue number and the number is calculated. It was found that the accuracy of predicting the number of sound sources increases when used as a prediction parameter. Therefore, by calculating the eigenvalue profile in this way, sound source localization can be performed stably and with high accuracy by the MUSIC method.

  Preferably, the boundary between the first frequency range and the second frequency range is in a range of 2.5 kHz to 4 kHz, for example, 3 kHz or 4 kHz.

  More preferably, the sound source number estimation unit includes a nonlinear estimation unit that has been learned in advance so as to estimate the correct number of sound sources using a set of eigenvalue profiles as a parameter.

  More preferably, the nonlinear estimation means includes learning data storage means for storing a plurality of learning data, each of which includes a combination of eigenvalue profiles and the number of sound sources corresponding to each set of eigenvalue profiles, and eigenvalues. Means for estimating the number of sound sources by the k-nearest neighbor method using the learning data stored in the learning data storage means using the set of eigenvalue profiles calculated by the profile calculation means as parameters.

  The k-neighbor method has a small amount of calculation and is convenient for implementation in a robot with limited resources. Experiments have shown that good results can be obtained. Therefore, by estimating the number of sound sources using the k-nearest neighbor method, sound source localization can be performed stably and accurately with the MUSIC method.

  The number of nearby learning data used in the means for estimating the number of sound sources by the k-nearest neighbor method may be six.

  According to experiments, in the k-nearest neighbor method, the highest accuracy was obtained when k = 6 was selected as the neighborhood.

  More preferably, the sound source localization apparatus further includes sound source tracking means for tracking the sound source azimuth estimated at predetermined time intervals by the sound source estimating means on the time axis.

  In the following description of the embodiments of the present invention, the same reference numerals are assigned to the same components. Their functions are also the same. Therefore, detailed description thereof will not be repeated.

[Overview]
In the present embodiment, a microphone array is arranged near the head of the robot, a plurality of sound sources are localized in real time from signals obtained from these microphone arrays, and tracking thereof is performed. For this purpose, the sound source localization apparatus according to the embodiment described below uses a mechanism for estimating the number of sound sources using information obtained from sound source signals based on learning data acquired in advance. By performing sound source localization by the MUSIC method (see the appendix) using the estimated number of sound sources, sound source localization can be performed stably and accurately.

[Constitution]
FIG. 1 shows a state in which the microphone array is fitted to the chest of the robot 30. Specifically, a microphone base 32 for fitting a microphone around the neck of the robot 30 is created, a plurality of microphones MC1 and the like are fixed to the microphone base 32, and then the microphone base 32 is placed around the neck of the robot 30. It is fixed.

  FIG. 2 shows a front view, a plan view, and a right side view of the microphone base 32. Referring to FIG. 2, only 14 microphones MC1 etc. are used. Nine of them are attached to the front part of the microphone base 32, and the remaining five are attached to the upper surface of the microphone base 32 so as to surround the neck of the robot 30. Of the 14 microphones, the output of the microphone MC1 at the center is used separately from others in the subsequent processing. In this embodiment, each microphone is a non-directional microphone.

  FIG. 3 is a block diagram showing only the sound source localization processing unit 50 related to sound source localization from the robot shown in FIG. Referring to FIG. 3, sound source localization processing unit 50 receives 14 analog sound source signals from microphone array 52 including microphone MC1 and the like, performs analog / digital conversion, and outputs 14 digital sound source signals. An eigenvector for receiving the D converter 54 and 14 digital sound source signals output from the A / D converter 54 and outputting a correlation matrix, its eigenvalue and eigenvector required by the MUSIC method every 200 milliseconds. Using the calculation unit 60 and the eigenvectors and eigenvalues output from the eigenvector calculation unit 60 every 200 milliseconds, a plurality of sound source positions are estimated based on the MUSIC method and represent the positions (directions) (this embodiment) In this form, it is assumed that two declination angles φ and θ in the three-dimensional polar coordinates (refer to “MUSIC response” in the appendix). The time series of the outputs of the determination unit 62 and the sound source estimation unit 62 are accumulated, the sound sources that exist continuously in time are grouped, the isolated sound sources are deleted (filtered), and the direction of the sound source that fluctuates in time is determined. A grouping unit 64 for estimation and a buffer 66 used by the grouping unit 64 to store information representing the sound source position are included.

  In the present embodiment, the A / D converter 54 performs A / D conversion on the output of each microphone at a general 16 kHz / 16 bits.

  The eigenvector calculation unit 60 framing the 14 digital sound source signals output from the A / D converter 54 with a frame length of 25 milliseconds, and the output 14 from the framing processing unit 80. FFT (Fast Fourier Transform) is applied to the sound source signals framed in the channel, and a predetermined number of frequency regions (hereinafter, each frequency region is referred to as “bin”, and the number of frequency regions is referred to as “bin number”). .)), And a block processing unit 84 for blocking each bin value of each channel output from the FFT processing unit 82 every 25 milliseconds. And a correlation matrix whose element is a correlation between the bin values output from the block processing unit 84 (200 mm). A correlation matrix calculation unit 86 that calculates and outputs every second), and an eigenvalue decomposition unit 88 that performs eigenvalue decomposition on the correlation matrix output from the correlation matrix calculation unit 86 and outputs the eigenvalue 90 and the eigenvector 92 to the sound source estimation unit 62; including. In the present embodiment, the frequency component of the sound source signal excludes a band of 1 kHz or less with a low spatial resolution and a band of 6 kHz or more where spatial aliasing may occur.

  The sound source estimation unit 62 uses the position vector storage unit 100 for storing a position vector representing the position of each microphone included in the microphone array 52 using a predetermined coordinate system, and the eigenvalue given from the eigenvalue decomposition unit 88 as parameters. The sound source number estimating unit 102 for estimating the number of sound sources and outputting the estimated number of sound sources (referred to as “NOS”) and the NOS and position vector storage unit 100 provided from the sound source number estimating unit 102 are stored. A MUSIC spatial spectrum calculation unit 104 that calculates and outputs a value called a MUSIC spatial spectrum in the MUSIC method using the microphone position vector and the eigenvector output from the eigenvalue decomposition unit 88. The fact that the eigenvalue of the correlation matrix obtained for each block is related to the number of sound sources is, for example, F.D. Asano et al., “Real-time sound source localization and generation system and its application in automatic speech recognition”, Eurospeech, 2001, Aalborg, Denmark, 2001, 1013-1016 (F. Asano, M. Goto, K. Itou, and H. Asoh, “Real-time sound source localization and separation system and its application on automatic speech recognition,” in Eurospeech 2001, Aalborg, Denmark, 2001, pp. 1013-1016) It is.

  In the present embodiment, not only the two-dimensional azimuth angle of each sound source but also the elevation angle is estimated. To that end, a three-dimensional version of the MUSIC algorithm (see appendix) was implemented. The set of azimuth and elevation is hereinafter referred to as sound source azimuth (DOA). This algorithm does not estimate the distance to the sound source. By estimating only the sound source azimuth, the processing time can be significantly reduced.

  The sound source estimation unit 62 further includes a MUSIC response calculation unit 106 for calculating and outputting a value called a MUSIC response for each direction according to the MUSIC method based on the MUSIC spatial spectrum calculated by the MUSIC spatial spectrum calculation unit 104. The peak of the MUSIC response calculated by the MUSIC response calculation unit 106 is detected for each block in order from the largest value by the number estimated by the sound source number estimation unit 102, and the direction as information indicating the sound source direction is detected. And a peak detector 108 for outputting.

  4 is a more detailed block diagram of the sound source number estimation unit 102 shown in FIG. Referring to FIG. 4, sound source number estimation section 102 has eigenvalues obtained from each bin in the frequency band of 1-3 kHz (1 kHz or more, 3 kHz or less) among eigenvalues 90 given for each bin from eigenvalue decomposition section 88. The first average value calculation unit 120 for calculating the average value of each eigenvalue number, and the second average for calculating the average value of the eigenvalues obtained from each bin in the frequency band of 3-6 kHz for each eigenvalue number A value calculation unit 122. The sound source number estimation unit 102 further uses, as parameters, an average value set of eigenvalues given from the first average value calculation unit 120 and an average value set of eigenvalues given from the second average value calculation unit 122 as parameters. (K-Nearest Neighbor: k-neighbor method) KNN classifier 124 for estimating the number of sound sources by classification and outputting as NOS 128, and learning data storage unit for storing learning data used by KNN classifier 124 for NOS estimation 126.

  In the learning data storage unit 126, as will be described later, the output values from the first average value calculating unit 120 and the second average value calculating unit 122 obtained at the time of creating prior learning data, and the number of sound sources at that time A plurality of samples (learning data) are stored. At the time of estimation, the k learning data closest to the input vector in the vector space are selected using the parameters (two sets of eigenvalues) given from the first average value calculation unit 120 and the second average value calculation unit 122 as input vectors. The k learning data are classified by the number of sound sources, and the number of sound sources into which the largest number of learning data is classified is output as the estimated number of sound sources. In this embodiment, k = 6.

  As shown in FIG. 4, the number-of-sound sources estimation unit 102 divides the frequency band of 1 kHz-6 kHz into two, calculates an average value set of eigenvalues for each, and uses it as a parameter for prediction. The reason for this will be described later. In short, it is because, according to the experiment, when the parameters are set in this way, the estimation accuracy of the sound source direction is the highest.

  The peak detector 108 obtains DOA by detecting a local peak. However, if the directions of the two sound sources are close, only one local peak may be found. For example, this is a case where two peaks almost overlap each other like a peak 200 shown in FIG. In such a case, the peak detector 108 performs peak detection as follows in order to be able to detect as many peaks as the number of sound sources.

  First, the maximum local peak 200 is detected. This local peak becomes one peak (DOA). Next, the two-dimensional Gaussian 202 (see FIG. 11B) is subtracted from the MUSIC response of the local peak portion. This two-dimensional Gaussian has a standard deviation when one sound source exists and has an amplitude of a detected peak. As a result of this operation, if another peak overlaps with this peak, only the peak 204 (see FIG. 11C) that cannot be detected due to the overlap remains in this portion. This operation is repeated for the number of sound sources (NOS).

  Filtering and tracking of the sound source direction by the grouping unit 64 is performed as follows. If the number of sound sources is overestimated, incorrect DOA insertion occurs. In the method according to the present embodiment, an isolated DOA is deleted by using a filtering algorithm based on grouping. This algorithm determines whether to group current DOA candidates from DOA detected in the past 10 blocks (corresponding to 2 seconds). Specifically, grouping is performed when the following conditions are satisfied.

  (1) The previous DOA is inside a “cone” with the current DOA as the tip. Here, the “cone” is a cone in the three-dimensional coordinate space with the sound source azimuth (θ, φ) and the time t obtained by the three-dimensional MUSIC method described in the “Appendix” as coordinates. The height direction of the cone is parallel to the time axis. In the present embodiment, the bottom surface of the cone is set to have a shape with an azimuth angle of ± 30 degrees and an elevation angle of ± 7 degrees. These values were set to heuristics based on the higher probability that a person would move in the horizontal direction (change in azimuth) than in the vertical direction (change in elevation angle).

  (2) The distance between the current DOA and the trend line of the group to which the previous DOA belongs is smaller than a certain threshold. Here, the “trend line” refers to a straight line that can be considered by extrapolating the regression line for the DOA column belonging to the group to the current time point.

  According to the first condition, the DOA can be grouped only into a group having a sound source direction (θ, φ) close to each other, and according to the second condition, the DOA is not grouped into a group having a different directionality. Experiments have shown that any sound source can be tracked well by grouping DOAs according to these criteria. In particular, even when two different sound sources approach each other so that the sound source directions overlap and move so as to cross each other and move away from each other, the sound source directions can be correctly detected.

  FIG. 5 is a block diagram of the learning data creation processing unit 140 for creating learning data stored in advance in the learning data storage unit 126 shown in FIG. In this embodiment, the learning data creation processing unit 140 is also provided in the robot 30, but the learning data creation processing unit 140 is provided outside the robot 30 to create learning data, and only the created learning data is created. It goes without saying that may be incorporated into the robot 30.

  Referring to FIG. 5, learning data creation processing unit 140 receives the same microphone array 52 and A / D converter 54 as those shown in FIG. 3, and the digital sound source signal output from A / D converter 54. 3 includes an eigenvalue calculation unit 160 for calculating and outputting eigenvalues for each frequency band of the sound source signal every 200 milliseconds by the same processing as the eigenvector calculation unit 60 shown in FIG. The eigenvalue calculation unit 160 is a framing processing unit 180 that performs the same processing as the framing processing unit 80, the FFT processing unit 82, the blocking processing unit 84, the correlation matrix calculation unit 86, and the eigenvalue decomposition unit 88 shown in FIG. An FFT processing unit 182, a blocking processing unit 184, a correlation matrix calculation unit 186, and an eigenvalue decomposition unit 188 are included. However, the eigenvalue decomposition unit 188 outputs only eigenvalues.

  The eigenvalue calculation unit 160 further receives an output 194 of the framed frequency component output by the FFT processing unit 182 every 25 milliseconds, and performs cross channel spectrum binary masking processing (this) between each channel of the sound source signal. (The details will be described later.) And a sound source signal subjected to cross channel spectrum binary masking processing output from the first binary masking processing unit 162. Are further included with a second binary masking processing unit 164 for performing binary masking processing with a frequency component (output 192 of the FFT processing unit 182) obtained from the central microphone MC1.

  The cross channel binary masking process performed by the first binary masking processing unit 162 refers to the following process. The signals of the two channels are converted into the frequency domain, and the values of individual frequency components are compared for each frame for the two signals after conversion. The larger (stronger) value is retained, and 0 is assigned to the smaller (weaker) value. Thereafter, both signals are returned to the time domain. This processing is performed in order to leave only the signal from the microphone closer to the sound source when picking up sound having a correlation between the microphones. By this processing, sound leakage between channels can be suppressed, and a more reliable reference signal can be obtained.

  The binary masking process performed by the second binary masking processing unit 164 refers to the following process. An arbitrary signal (referred to as a signal to be processed) is taken, and the signal and the signal obtained from the microphone MC1 are both converted into the frequency domain. For these two signals, the values of individual frequency components are compared for each frame. If the value of the signal to be processed is larger (stronger), the value is left, and if smaller, 0 is assigned to the value of the frequency component. Thereafter, the signal to be processed is returned to the time domain. This process is a process for excluding environmental sounds from each sound source signal. Therefore, here, the frequency component from the center microphone MC1 is used as a reference.

  The learning data creation processing unit 140 further includes a power calculation unit for calculating the power every 25 milliseconds for the sound source signal of each channel output from the second binary masking processing unit 164 for each 25 millisecond frame. 168 and a time series (power trajectory) of the power value of each channel calculated by the power calculation unit 168 are compared with a threshold value, and a sound source signal exceeding the threshold value is determined to be active, and is active. A sound source number determination unit 172 for outputting the number of sound source signals as the number of sound sources of the frame, and a threshold value storage unit 170 for preliminarily storing a threshold value used for determination by the sound source number determination unit 172 are included. The number of sound sources output from the sound source number determination unit 172 is referred to as PNOS in order to distinguish it from the number of sound sources (NOS) output from the sound source number estimation unit 102 shown in FIG.

  The learning data creation processing unit 140 further stores the learning data in the learning data storage unit 126 as a set of the eigenvalue set output from the eigenvalue decomposition unit 188 every 200 milliseconds and the PNOS value given from the sound source number determination unit 172. And a learning data storage unit 174 for storing as. Although not shown in FIG. 5, the learning data storage unit 174 uses the eigenvalue given from the eigenvalue decomposition unit 188 to set the eigenvalue given by the eigenvalue decomposition unit 188 in the same manner as the sound source number estimation unit 102 shown in FIG. It is averaged separately for the 6 kHz band and converted into two sets of eigenvalues, and then stored in the learning data storage unit 126 together with the PNOS value at that time. Since the eigenvalue from the eigenvalue decomposition unit 188 is obtained every 200 milliseconds and the PNOS is obtained every 25 milliseconds, it is converted into blocks every 200 milliseconds by averaging and rounding.

  The reason why the eigenvalues are averaged separately in the 1-3 kHz band and the 3-6 kHz band in the sound source number estimation unit 102 in FIG. 4 will be described below.

  The inventor conducted the following experiment in order to examine how the eigenvalue is affected by the environmental change, and arranged the eigenvalue for each PNOS. A set of eigenvalues is obtained for each bin. Here, in order to obtain one typical eigenvalue set for each block, eigenvalues averaged in a specific frequency band are treated as eigenvalue profiles of the blocks.

  6 to 8 show profiles of eigenvalues for each block arranged for each PNOS in three different environments (OFC, UCW1, and UCW2). These environments are environments that record audio for experiments, and are specifically as follows.

  That is, data recording by a microphone array was performed in two different environments. The first is the office environment (OFC), where the internal noise of indoor air conditioners and robots is the main noise source. The second environment is an open-air shopping mall corridor (UCW) where Robbie's demonstration experiment is currently being conducted. The main noise source in UCW is pop-rock music flowing from speakers installed on the ceiling. Recorded at various locations and in various directions in the aisle.

  Here, the results of four of them are shown. The first is an office environment (OFC), and the remaining three are shopping mall environments (UCW1-3). Table 1 shows the details about the sound source in each recording.


With reference to FIGS. 6-8, it is observed that the number of sound sources (NOS) is related to the general offset and shape of the eigenvalue profile. As the shape of an ideal eigenvalue profile, the first N eigenvalues corresponding to the number N of directional sound sources indicate strong power, and the remaining MN eigenvalues corresponding to omnidirectional sound sources are small. Indicates power. However, in the actual shape displayed in FIG. 6 to FIG. 8, the boundary between the directional sound source component and the omnidirectional sound source component is unclear, and the omnidirectional component is not flat and shows a gentle inclination. ing.

  It is also observed that eigenvalue profiles partially overlap even between different PNOS. For example, in the OFC (FIG. 6), it is observed that the profiles of PNOS = 1 and PNOS = 2 are significantly overlapped. Therefore, it is expected that it is difficult to accurately estimate the number of sound sources by the classifier.

  Furthermore, it can be seen that the influence of the change in the environment on the shape of the eigenvalue profile is strong. Differences can be seen in width and inclination. For example, when the OFC and UCW1 PNOS = 0 eigenvalue profiles (NOS = 0 in FIGS. 6 and 7) are compared, the difference is clear. In UCW1, since there is background music, the value is larger than that of OFC, and the variation is larger. This indicates that environmental changes need to be considered in the classifier.

  Moreover, the influence on the eigenvalue profile by approaching the sound source of environmental music is also seen. The profile of UCW2's PNOS = 0 profile is similar to that of UCW1's PNOS = 1. This reflects that the environmental music becomes a new directional sound source when the robot approaches the sound source of the environmental music, and the omnidirectional sound source when the robot is away. When the environmental music is a directional sound source, the direction can be obtained, which is useful for the sound source separation of the target sound which is the subsequent processing.

  Finally, eigenvalue profiles averaged in different frequency bands were analyzed. 6 to 8 show profiles of eigenvalues averaged in frequency bins in three different frequency bands of 1-6 kHz (AVG1_6), 1-3 kHz (AVG1_3), and 3-6 kHz (AVG3_6), respectively. . Comparing the three columns in FIGS. 6 to 8 with a profile of NOS> 0, it can be seen that the difference between the first and sixth eigenvalues is larger in AVG3_6 (right column). From this result, it is considered that AVG3_6 has higher discrimination than AVG1_6 having a wide bandwidth. However, there is a possibility that the AVG3_6 may not detect in a voice section where a component of 3 kHz or more is weak such as / u / and / O /. From these results, two sets of eigenvalues obtained from the divided frequency bands were used for the classifier. By doing so, the AVG1_3 can relatively increase the discriminability in the interval where the component of 3 kHz or higher is weak, and the AVG3_6 can expect sufficient discriminability in the interval where the component of 3 kHz or higher is sufficiently strong.

  The kNN (k-Nearest Neighbors) algorithm was selected as the classification algorithm. This is because kNN has a small amount of calculation and can cope with non-linearity. Not only kNN but also a machine learning method that can deal with non-linearity, for example, SVM (Support Vector Machine) or NN (Neural Network) can be used.

  The eigenvalue obtained from the correlation matrix of the observed signal is used as an input parameter for the classification algorithm. Various classifiers were learned and evaluated using an average value set (AVG) or a maximum value set (MAX) of eigenvalues obtained through frequency bins in various frequency bands as inputs. Two sets of eigenvalues obtained by dividing the frequency band into two were also evaluated.

  The performance of the kNN classifier was evaluated by 10-fold cross validation for various k (number of nearest neighbors). FIG. 9 shows the degree of matching (NOS accuracy) between the estimated NOS and the reference PNOS in various classifiers. FIG. 9 shows the result when k = 6 where the NOS accuracy was highest.

  FIG. 9 shows that the average value (AVG) exhibits higher performance than the maximum eigenvalue (MAX). The highest performance was obtained when the average of [1-3] (frequency range 1 kHz-3 kHz) and “3-6” (3 kHz-6 kHz) was used. Also, when [1-4] and [4-6] are used, although lower than the combination of [1-3] and [3-6], it is compared with the case of [1-6] that summarizes the whole. This is a good value. This indicates that it is effective to use two sets of eigenvalues obtained by dividing the frequency band.

  FIG. 9 further shows that when [1-2] and [2-6] are used, the accuracy is almost the same as [1-6]. Therefore, when the frequency band is divided into two, It can be seen that the division frequency position where the accuracy becomes higher than 1-6] is around 2.5 kHz or more and 4 kHz or less.

  The influence of FFT bin number (NFFT) on the detection of sound source direction (DOA) was examined. FIG. 10 displays the DOA performance for various values of NFFT (32, 64, 128, 256, 512). NFFT = 128 showed the highest performance in estimating the sound source direction. However, as can be seen from FIG. 10, no significant performance degradation is observed even with a smaller NFFT. Using a smaller NFFT can greatly reduce the amount of computation. Therefore, NFFT = 64 is adopted in the above embodiment.

[Operation]
The sound source localization processing unit 50 according to the above embodiment operates as follows. It is assumed that the microphone array is attached to the robot 30 using a microphone base 32 as shown in FIGS.

  First, the learning data creation processing unit 140 when creating learning data operates as follows. The microphone array 52 converts the sound from the sound source into 14 analog electric signals and supplies the analog electric signals to the A / D converter 54. The A / D converter 54 converts these signals into 16-bit digital signals at 16 kHz, and provides 14 digital signals to the framing processor 180.

  The framing processing unit 180 framing the digital sound source signals of these channels with a frame length of 25 milliseconds, and gives them to the FFT processing unit 182. The FFT processing unit 182 performs FFT on the digital sound source signal of each frame of each channel, converts it to the output 194 of each frequency component, and gives it to the blocking processing unit 184 and the first binary masking processing unit 162. Out of the output 194 of the FFT processing unit 182, the output 192 obtained from the sound source signal from the microphone MC 1 is also supplied to the second binary masking processing unit 164.

  The blocking processing unit 184 blocks the signal output every 25 milliseconds from the FFT processing unit 182 every 200 milliseconds, and gives it to the correlation matrix calculation unit 186. Correlation matrix calculation section 186 calculates a correlation matrix for each channel for each of these blocks, and provides it to eigenvalue decomposition section 188. The eigenvalue decomposition unit 188 performs eigenvalue decomposition on the correlation matrix calculated by the correlation matrix calculation unit 186 and supplies the result to the learning data storage unit 174.

  On the other hand, the first binary masking processing unit 162 applies cross channel spectrum binary masking processing to the output 194 of the FFT processing unit 182 and gives the result to the second binary masking processing unit 164. The second binary masking processing unit 164 applies the frequency component (FFT) after FFT obtained from the sound source signal from the microphone MC1 to the frequency domain value of each channel output from the first binary masking processing unit 162. A binary masking process is performed based on the output 192) of the processing unit 82, and the signal of each channel is returned to the time domain and output.

  The power calculation unit 168 calculates the power for each sound source signal of each channel output from the second binary masking processing unit 164 every 25 milliseconds and supplies the power to the sound source number determination unit 172. The sound source number determination unit 172 tracks the power trajectory of the power of the sound source signal of each channel given from the power calculation unit 168, and a channel having a power larger than the threshold value stored in the threshold value storage unit 170 is found. If there is, the sound source signal of the channel is determined to be active, and the number of sound source signals active in the block is output to the learning data storage unit 174 as PNOS for each block.

  The learning data storage unit 174 calculates a first set of eigenvalues by averaging eigenvalues of 1-3 kHz among eigenvalues given from the eigenvalue decomposition unit 188 for each eigenvalue number. Similarly, the learning data storage unit 174 calculates the second set of eigenvalues by averaging the eigenvalues of 3-6 kHz for each eigenvalue number. The learning data storage unit 174 stores these two sets of eigenvalues as parameters in the learning data storage unit 126 together with PNOS as one learning data item.

  When the learning data is stored in the learning data storage unit 126 in this way, the sound source localization processing unit 50 can be operated by using the learning data storage unit 126 in the sound source number estimation unit 102 of FIG.

  At runtime, the sound source localization processing unit 50 operates as follows. The microphone array 52, the A / D converter 54, the framing processing unit 80, the FFT processing unit 82, the blocking processing unit 84, the correlation matrix calculation unit 86, and the eigenvalue decomposition unit 88 are used to generate learning data when learning data is generated. The same operations as the framing processing unit 180, the FFT processing unit 182, the blocking processing unit 184, the correlation matrix calculation unit 186, and the eigenvalue decomposition unit 188 of the unit 140 are performed. However, the eigenvalue decomposition unit 88 calculates not only eigenvalues but also eigenvectors 92 corresponding to the respective eigenvalues and gives them to the MUSIC space spectrum calculation unit 104 every 200 milliseconds. The eigenvalue 90 calculated by the eigenvalue decomposition unit 88 is given to the sound source number estimation unit 102. It is assumed that the position vector storage unit 100 stores a position vector corresponding to the arrangement of microphones in the microphone array 52 in advance.

  The sound source number estimation unit 102 operates as follows. The first average value calculation unit 120 averages eigenvalues in the region of 1-3 kHz among eigenvalues 90 given every 200 milliseconds for each eigenvalue number, and gives them to the KNN classifier 124 as a first set of eigenvalues. . The second average value calculation unit 122 averages eigenvalues in the region of 3-6 kHz out of the eigenvalues 90 given every 200 milliseconds for each eigenvalue number, and gives the result to the KNN classifier 124 as a second set of eigenvalues. .

  The KNN classifier 124 uses the first and second sets of eigenvalues given from the first average value calculation unit 120 and the KNN classifier 124 as parameters, and based on the learning data stored in the learning data storage unit 126, KNN The NOS 128 is estimated by classification and provided to the MUSIC spatial spectrum calculation unit 104.

  The processing after the MUSIC spatial spectrum calculation unit 104 is a three-dimensional processing of the normal MUSIC method. First, the MUSIC spatial spectrum calculation unit 104 calculates the MUSIC spatial spectrum every 200 milliseconds based on the position vector stored in the position vector storage unit 100 and the NOS given from the sound source number estimation unit 102, and calculates the MUSIC response. Part 106. The MUSIC response calculation unit 106 calculates the MUSIC response every 200 milliseconds based on the MUSIC spatial spectrum, and provides the MUSIC response calculation unit 106 to the peak detection unit 108. The peak detecting unit 108 searches the MUSIC response, detects the same number of peaks as the NOS given from the sound source number estimating unit 102 from the larger one, and gives it to the grouping unit 64. The grouping unit 64 is given the same number of sound source directions as the NOS every 200 milliseconds.

  For the first 10 blocks, the grouping unit 64 stores the sound source direction calculated for each block in the buffer 66. At the same time, the grouping unit 64 executes a process of grouping what seems to be the same sound source between blocks. The grouping method has already been described. After 11 blocks, the grouping unit 64 stores the sound source direction of each block in the buffer 66 by the first-in first-out method, and the sound source that has not been grouped with other sound source directions even after 10 blocks have elapsed. Remove the bearing. Thus, when there are groups that continue for two or more blocks, the grouping unit 64 outputs the sound source direction for each block, assuming that these groups represent one sound source.

  With the above operation, the sound source localization processing unit 50 can continuously perform localization and tracking of a plurality of sound sources.

  According to the sound source localization processing unit 50 according to the above-described embodiment, the eigenvalue of the correlation matrix between channels, which is known to be related to the number of sound sources, has a frequency region of 1-3 kHz and 3 for each block. An average is calculated for each eigenvalue number divided into a frequency region of −6 kHz and used as a parameter for estimating the number of sound sources. As is clear from the experimental results, the number of sound sources can be predicted with high accuracy by using such parameters, and sound source localization by the MUSIC method can be performed stably and with high accuracy.

  Furthermore, since the three-dimensional MUSIC method is used in the present embodiment, it is possible to estimate the sound source azimuth including not only the azimuth angle but also the elevation angle within a certain range. Therefore, even in an environment where voice is received from various directions in a real environment, a robot or the like can correctly locate a sound source and perform an appropriate operation. Even when the robot interacts with a human, it can be expected to perform an appropriate operation while looking at the face of the other party, and the interaction between the robot and the human can be made smoother.

  The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are intended. Including.

[Appendix: MUSIC method]
The Fourier transform Xm (k, t) of M microphone inputs is modeled as shown in Equation (1).


However, the vector s (k, t) consists of N sound source spectra S n (k, t): s (k, t) = [S 1 (k, t),..., S N (k, t) )] T. k and t indicate frequency and time frame indexes, respectively. Vector n (k, t) indicates background noise. The matrix A k is a conversion function matrix, and its (m, n) element is a conversion function of a direct path from the nth sound source to the mth microphone. The n-th column vectors of A k is referred to as a position vector of the n-th sound source (Steering Vector).

First, a spatial correlation matrix R k defined by Equation (2) is obtained, and E k composed of an eigenvalue diagonal matrix Λ k and an eigenvector is obtained by eigenvalue decomposition of R k shown in Equation (3).


The eigenvector can be divided as E k = [E ks | E kn ]. E ks and E kn indicate eigenvectors corresponding to the dominant N eigenvalues and other eigenvectors, respectively.

  The MUSIC spatial spectrum is obtained by equations (4) and (5). r is a distance, and θ and φ are an azimuth angle and an elevation angle, respectively. Equation (5) is a normalized position vector at the scanned point (r, θ, φ).


The spatial spectrum (referred to herein as a “MUSIC response”) is an averaged MUSIC spatial spectrum as shown in Equation (6).


In Expression (6), k L and k H are indices of the lower and upper boundaries of the frequency band, respectively, and K = k H −k L +1. The direction of the sound source is obtained from N peaks of the MUSIC response.

It is a figure which shows the state which attached the microphone stand 32 to the robot 30 which has the sound source localization process part 50 which concerns on one embodiment of this invention. 3 is a three-side view of a microphone base 32. FIG. 3 is a block diagram of a sound source localization processing unit 50. FIG. 3 is a block diagram of a sound source number estimation unit 102. FIG. 4 is a block diagram of a learning data creation processing unit 140. FIG. In a recording environment OFC, it is a figure which shows the profile of the eigenvalue for every block arranged for every PNOS. In the recording environment UCW1, it is a figure which shows the profile of the eigenvalue for every block arranged for every PNOS. In the recording environment UCW2, it is a figure which shows the profile of the eigenvalue for every block arranged for every PNOS. It is a graph which shows the experimental precision of estimation of the number of sound sources by a kNN classification method according to the calculation method of an eigenvalue profile. It is a graph which shows the estimation precision of the number of sound sources for every block by the value of the neighborhood number k in a kNN classification method. It is a figure for demonstrating the peak detection method by the peak detection part.

Explanation of symbols

30 Robot 32 Microphone base 50 Sound source localization processing unit 52 Microphone array 60 Eigen vector calculation unit 62 Sound source estimation unit 64 Grouping unit 86, 186 Correlation matrix calculation unit 88, 188 Eigen value decomposition unit 102 Sound source number estimation unit 104 MUSIC spatial spectrum calculation unit 106 MUSIC Response calculation unit 108 Peak detection unit 120 First average value calculation unit 122 Second average value calculation unit 124 KNN classifier 126 Learning data storage unit 162 First binary masking processing unit 164 Second binary masking processing unit 168 Power calculation unit 172 Sound source number determination unit 174 Learning data storage unit

Claims (10)

  1. Conversion means for converting each of the sound source signals of a plurality of channels obtained from the output of the microphone array into frequency components of a plurality of frequency bands at predetermined time intervals;
    Correlation matrix calculating means for obtaining a spatial correlation matrix between frequency components for each of the predetermined time intervals for each of the plurality of frequency bands of the sound source signals of the plurality of channels obtained by the converting means;
    The correlation matrix calculation means performs eigenvalue decomposition on each of the spatial correlation matrices calculated for each of the plurality of frequency bands at each predetermined time interval, and calculates eigenvectors and eigenvalues for each of the plurality of frequency bands. Eigenvector calculation means for performing,
    Eigenvalue profile calculation means for calculating eigenvalue profiles for the first and second frequency ranges based on the eigenvalues calculated for each of the plurality of frequency bands by the eigenvector calculation means at each predetermined time interval. Each of the first and second frequency ranges includes one or more of the plurality of frequency bands,
    further,
    A sound source number estimating means for estimating the number of sound sources at each predetermined time interval, using as a parameter the set of eigen value profiles calculated by the eigen value profile calculating means for the first and second frequency ranges;
    Based on the number of sound sources estimated by the sound source number estimating means, information on the arrangement of microphone elements belonging to the microphone array, and eigenvectors calculated by the eigenvector calculating means at each predetermined time interval, the MUSIC method is used. A sound source localization apparatus including sound source estimation means for estimating the number of sound source directions equal to the number of sound sources.
  2. The sound source localization apparatus according to claim 1, wherein the first frequency range and the second frequency range are continuous with each other.
  3. The sound source localization apparatus according to claim 1, wherein the first frequency range and the second frequency range do not overlap each other.
  4. The sound source localization apparatus according to claim 3, wherein a lower limit of the first and second frequency ranges is 1 kHz, and an upper limit is 6 kHz.
  5. The eigenvalue profile calculation means includes:
    First eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the first frequency range by the eigenvector calculating means for each eigenvalue number;
    Second eigenvalue averaging means for averaging eigenvalues calculated for the frequency bands belonging to the second frequency range by the eigenvector calculating means for each eigenvalue number;
    5. A means for creating and outputting the eigenvalue profile by averaging eigenvalues calculated for each eigenvalue number by the first and second eigenvalue averaging means. Sound source localization device.
  6. The sound source localization apparatus according to claim 5, wherein a boundary between the first frequency range and the second frequency range is in a range of 2.5 kHz to 4 kHz.
  7. The sound source localization apparatus according to any one of claims 1 to 6, wherein the sound source number estimation means includes non-linear estimation means that has been learned in advance so as to estimate a correct number of sound sources using the set of eigenvalue profiles as a parameter.
  8. The nonlinear estimation means includes
    Learning data storage means for storing a plurality of learning data each comprising a combination of the eigenvalue profile and the number of sound sources corresponding to the set of eigenvalue profiles as a parameter;
    And a means for estimating the number of sound sources by a k-nearest neighbor method using the learning data stored in the learning data storage means using the set of eigenvalue profiles calculated by the eigenvalue profile calculation means as a parameter. The sound source localization apparatus according to 7.
  9. The sound source localization apparatus according to claim 8, wherein the number of neighboring learning data used by the means for estimating the number of sound sources by the k-neighbor method is six.
  10. 10. The sound source localization apparatus according to claim 1, further comprising a sound source tracking unit for tracking the sound source direction estimated at the predetermined time interval by the sound source estimation unit on a time axis.
JP2008293831A 2008-11-17 2008-11-17 Sound-source localizing device Pending JP2010121975A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2008293831A JP2010121975A (en) 2008-11-17 2008-11-17 Sound-source localizing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2008293831A JP2010121975A (en) 2008-11-17 2008-11-17 Sound-source localizing device

Publications (1)

Publication Number Publication Date
JP2010121975A true JP2010121975A (en) 2010-06-03

Family

ID=42323454

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2008293831A Pending JP2010121975A (en) 2008-11-17 2008-11-17 Sound-source localizing device

Country Status (1)

Country Link
JP (1) JP2010121975A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012042465A (en) * 2010-08-17 2012-03-01 Honda Motor Co Ltd Sound source direction estimation device and sound source direction estimation method
JP2012150237A (en) * 2011-01-18 2012-08-09 Sony Corp Sound signal processing apparatus, sound signal processing method, and program
JP2012211768A (en) * 2011-03-30 2012-11-01 Advanced Telecommunication Research Institute International Sound source positioning apparatus
JP2014059225A (en) * 2012-09-18 2014-04-03 Toshiba Corp Receiver, noise suppression method, and noise suppression program
CN103901888A (en) * 2014-03-18 2014-07-02 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors
WO2014125736A1 (en) * 2013-02-14 2014-08-21 ソニー株式会社 Speech recognition device, speech recognition method and program
US9318124B2 (en) 2011-04-18 2016-04-19 Sony Corporation Sound signal processing device, method, and program
US9357298B2 (en) 2013-05-02 2016-05-31 Sony Corporation Sound signal processing apparatus, sound signal processing method, and program

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012042465A (en) * 2010-08-17 2012-03-01 Honda Motor Co Ltd Sound source direction estimation device and sound source direction estimation method
JP2012150237A (en) * 2011-01-18 2012-08-09 Sony Corp Sound signal processing apparatus, sound signal processing method, and program
US9361907B2 (en) 2011-01-18 2016-06-07 Sony Corporation Sound signal processing apparatus, sound signal processing method, and program
JP2012211768A (en) * 2011-03-30 2012-11-01 Advanced Telecommunication Research Institute International Sound source positioning apparatus
US9318124B2 (en) 2011-04-18 2016-04-19 Sony Corporation Sound signal processing device, method, and program
JP2014059225A (en) * 2012-09-18 2014-04-03 Toshiba Corp Receiver, noise suppression method, and noise suppression program
WO2014125736A1 (en) * 2013-02-14 2014-08-21 ソニー株式会社 Speech recognition device, speech recognition method and program
US10475440B2 (en) 2013-02-14 2019-11-12 Sony Corporation Voice segment detection for extraction of sound source
US9357298B2 (en) 2013-05-02 2016-05-31 Sony Corporation Sound signal processing apparatus, sound signal processing method, and program
CN103901888A (en) * 2014-03-18 2014-07-02 北京工业大学 Robot autonomous motion control method based on infrared and sonar sensors

Similar Documents

Publication Publication Date Title
JP4815661B2 (en) Signal processing apparatus and signal processing method
DE60303338T2 (en) Orthogonal and circular group system of microphones and method for detecting the three-dimensional direction of a sound source with this system
JP4247195B2 (en) Acoustic signal processing apparatus, acoustic signal processing method, acoustic signal processing program, and recording medium recording the acoustic signal processing program
JP4986433B2 (en) Apparatus and method for recognizing and tracking objects
Kumatani et al. Microphone array processing for distant speech recognition: From close-talking microphones to far-field sensors
JP3522954B2 (en) Microphone array input type speech recognition apparatus and method
US20090018828A1 (en) Automatic Speech Recognition System
Nakadai et al. Real-time sound source localization and separation for robot audition
KR20090057692A (en) Method and apparatus for filtering the sound source signal based on sound source distance
JP3906230B2 (en) Acoustic signal processing apparatus, acoustic signal processing method, acoustic signal processing program, and computer-readable recording medium recording the acoustic signal processing program
Brandstein et al. A practical methodology for speech source localization with microphone arrays
McCowan et al. Microphone array shape calibration in diffuse noise fields
JP4248445B2 (en) Microphone array method and system, and voice recognition method and apparatus using the same
TW201234873A (en) Sound acquisition via the extraction of geometrical information from direction of arrival estimates
JP2004274763A (en) Microphone array structure, beam forming apparatus and method, and method and apparatus for estimating acoustic source direction
Nakadai et al. Improvement of recognition of simultaneous speech signals using av integration and scattering theory for humanoid robots
US9460732B2 (en) Signal source separation
JP2008079256A (en) Acoustic signal processing apparatus, acoustic signal processing method, and program
US20080247274A1 (en) Sensor array post-filter for tracking spatial distributions of signals and noise
JP2004325284A (en) Method for presuming direction of sound source, system for it, method for separating a plurality of sound sources, and system for it
Brutti et al. Oriented global coherence field for the estimation of the head orientation in smart rooms equipped with distributed microphone arrays
US9264806B2 (en) Apparatus and method for tracking locations of plurality of sound sources
Bub et al. Knowing who to listen to in speech recognition: Visually guided beamforming
TWI647961B (en) High-order fidelity stereo sound field representation method determined in the direction of a sound source apparatus and related method is not
Ishi et al. Evaluation of a MUSIC-based real-time sound localization of multiple sound sources in real noisy environments