US20110222707A1 - Sound source localization system and method - Google Patents
Sound source localization system and method Download PDFInfo
- Publication number
- US20110222707A1 US20110222707A1 US12/844,004 US84400410A US2011222707A1 US 20110222707 A1 US20110222707 A1 US 20110222707A1 US 84400410 A US84400410 A US 84400410A US 2011222707 A1 US2011222707 A1 US 2011222707A1
- Authority
- US
- United States
- Prior art keywords
- sound source
- sitds
- source localization
- time
- signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000004807 localization Effects 0.000 title claims abstract description 78
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 230000006870 function Effects 0.000 description 10
- 238000013528 artificial neural network Methods 0.000 description 9
- 238000001914 filtration Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 210000005069 ears Anatomy 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 230000001537 neural effect Effects 0.000 description 6
- 230000004044 response Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 230000008901 benefit Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 210000003128 head Anatomy 0.000 description 2
- 210000003552 inferior colliculi Anatomy 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 240000007817 Olea europaea Species 0.000 description 1
- 210000003926 auditory cortex Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 210000000883 ear external Anatomy 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R5/00—Stereophonic arrangements
- H04R5/027—Spatial or constructional arrangements of microphones, e.g. in dummy heads
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R3/00—Circuits for transducers, loudspeakers or microphones
- H04R3/005—Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/20—Arrangements for obtaining desired frequency or directional characteristics
- H04R1/32—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
- H04R1/40—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
- H04R1/406—Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R5/00—Stereophonic arrangements
- H04R5/04—Circuit arrangements, e.g. for selective connection of amplifier inputs/outputs to loudspeakers, for loudspeaker detection, or for adaptation of settings to personal preferences or hearing impairments
Definitions
- Disclosed herein is a sound source localization system and method.
- a sound source localization technique is a technique for localizing the position at which a sound source is generated by analyzing properties of a signal inputted from a microphone array. That is, the sound source localization technique is a technique capable of effectively localizing a sound source generated from a human robot interaction and a place beyond the sight of a vision camera.
- FIG. 1 is a diagram showing a related art sound source localization technique using a microphone array.
- a microphone array has the form of a specific structure as shown in FIG. 1 , and a sound source is localized using such a microphone array.
- a direction angle is mainly detected by measuring a difference in time at which a voice signal reaches each microphone from the sound source.
- FIG. 2 is a diagram illustrating a problem caused when the related art sound source localization technique is applied to a sound source localization technique using two ears.
- a method using a head related transfer function (HRTF) has been proposed so as to solve such a problem.
- HRTF head related transfer function
- the influence caused by a platform is removed by re-measuring respective impulse responses based on the forms of the corresponding platform.
- signals based on respective directions are necessarily obtained in a dead room, and hence, measurement is complicated whenever the form of the platform is changed. Therefore, the method using the HRTF has a limitation in its application to robot auditory systems with various types of platforms.
- a sound source localization system and method in which a sparse coding and a self-organized map (SOM) are used to implement sound source localization using a sound source localization path of a human being as a model, so that the system and method can be applied to various types of platforms because impulse responses are unnecessary to be measured every time, and can be used in various robot development fields because it is possible to be adapted to a change in environment.
- SOM self-organized map
- a sound source localization system including: a plurality of microphones for receiving a signal as an input from a sound source; a time-difference extraction unit for decomposing the signal inputted through the plurality of microphones into time, frequency and amplitude using a sparse coding and then extracting a sparse interaural time difference (SITD) inputted through the plurality of microphones for each frequency; and a sound source localization unit for localizing the sound source using the SITDs.
- SITD sparse interaural time difference
- a sound source localization method including: receiving a signal as an input from a sound source; decomposing the signal into time, frequency and amplitude using a sparse coding; extracting an SITD for each frequency; and localizing the sound source using the SITDs.
- FIG. 1 is a diagram showing a related art sound source localization technique using a microphone array
- FIG. 2 is a diagram illustrating a problem caused when the related art sound source localization technique is applied to a sound source localization technique using two ears;
- FIG. 3 is a diagram illustrating a correspondence relation between a human's sound source localization system and a sound source localization system according to an embodiment
- FIG. 4 is a block diagram schematically showing the sound source localization system according to the embodiment.
- FIGS. 5A to 5D are graphs showing results obtained by applying filters of the sound source localization system according to the embodiment.
- FIG. 6 is a flowchart schematically illustrating a sound source localization method according to an embodiment.
- FIG. 3 is a diagram illustrating a correspondence relation between a human's sound source localization system and a sound source localization system according to an embodiment.
- a generated sound source signal is inputted through two (two-channel) microphones attached to an artificial ear (Kemar ear) 301 corresponding to a human's ear 301 ′. Then, the inputted sound source signal is digitalized for sound source localization. Since the inputted signal is processed based on a cognition model of a human's auditory sense, the sound source localization system corresponds to organs that play roles in the human's auditory sense. The localization of the inputted sound source divided into two parts, i.e., a neural coding 302 and a neural network 303 .
- the part of the neural coding 302 serves as a medial superior olive (MSO) 302 ′ that extracts a sparse interaural time difference (SITD) used for the sound source localization.
- MSO medial superior olive
- SITD sparse interaural time difference
- the part of the neural network 303 serves as an inferior colliculus (IC) 303 ′ that localizes a sound source and plays a role of learning.
- IC inferior colliculus
- a sound source localization 304 is also performed in the sound source localization system according to the embodiment by passing through the parts of the neural coding 302 and the neural network 303 .
- the number of microphones used is two. However, this is provided only for illustrative purposes, and is not limited thereto. That is, the sound source localization system according to the embodiment may be provided with three or more microphones as occasion demands. For example, the sound source localization system according to the embodiment may be applied in such a manner that a plurality of microphones are divided into two groups and the two groups are respectively disposed at the left and right of a model with the contour of a human's face, or the like.
- FIG. 4 is a block diagram schematically showing the sound source localization system according to the embodiment.
- the sound source localization system is generally divided into a neural coding and a neural network.
- the neural coding since the neural coding extracts an SITD, it may correspond to a time-difference extraction unit 410 .
- the neural network since the neural network localizes a sound source using the SITD, it may correspond to a sound source localization unit 420 .
- the algorithm of the time-difference extraction unit 410 may be performed as follows.
- a sound source signal 400 is first inputted through two (two-channel) microphones and then digitalized for signal processing.
- the inputted sound source signal 400 may be digitalized at a desired sampling rate, e.g., 16 kHz.
- the digitalized sound source signal 411 may be inputted as a unit of frame (200 ms) to a gammatone filterbank 412 having 64 different center frequencies.
- the digitalized sound source signal 411 may be filtered for each of the frequencies and then inputted to a sparse coding 413 .
- An SITD may be evaluated by passing through the sparse coding 413 , and errors may be removed from the evaluated SITD by passing through three types of filters 414 .
- the three types of filters 414 will be described later.
- the sound source signal 400 is inputted through the two (two-channel) microphones and then digitalized.
- the digitalized sound source signal is divided as a unit of frame (200 ms) and then transferred to a gammatone filterbank 412 .
- the SITD is changed by the influence of a facial surface.
- the SITD is necessarily evaluated, and hence, the gammatone filterbank 412 for filtering the sound source signal for each frequency is used in the sound source localization system according to the embodiment.
- the gammatone filterbank 412 is a filter structure obtained by performing modeling with respect to sound processing in a human's outer ear. Particularly, as the gammatone filterbank 412 includes a set of bandpass filters that serve as cochleae, the impulse response of the filterbank is evaluated using a gammatone function as shown in the following equation 1.
- r(n,b) denotes a normalization factor
- b denotes a bandwidth
- w denotes a center frequency
- the number of filters and the center frequency and bandwidth of the filterbank are required to produce the gammatone filterbank.
- the number of filters is determined by the maximum frequency (f H ) and the minimum frequency (f L ).
- the number of filters is evaluated by the following equation 2. In this embodiment, the maximum and minimum frequencies are set as 100 Hz and 8 KHz, respectively, and the number of filters is then evaluated.
- n 9.26 v ⁇ ln ⁇ ⁇ f H + 228.7 f L + 228.7 ( 2 )
- v denotes the number of overlapped filters.
- the center frequency is evaluated by the following equation 3.
- the number of filters and the center frequency of the filterbank are evaluated using the aforementioned equations, and 64 gammatone filters are then produced by applying the bandwidth of an equivalent rectangular bandwidth (ERB) filter.
- ERB equivalent rectangular bandwidth
- the ERB filter is a filter proposed on the assumption that the auditory filter has a rectangular shape and the same noise power is passed in the same critical bandwidth.
- the bandwidth of the ERB filter is generally used for the gammatone filter.
- the technique of a sparse coding 412 is used in which the inputted sound source signal is decomposed into three factors of time, frequency and amplitude.
- a general signal is decomposed into three factors of time, frequency and amplitude by the following equation 4, using a sparse and kernel method.
- T i m denotes a time
- S i m denotes a coefficient of an i-th time
- ⁇ m denotes a kernel function
- n m denotes the number of kernel functions
- ⁇ (t) denotes a noise.
- the kernel function disclosed herein is a gammatone filterbank. Since the gammatone filterbank has various frequency bands, each of the signals may be decomposed into three factors of time, frequency and amplitude.
- a matching pursuit algorithm has been used in this embodiment.
- the time difference between two channels (signals of left and right ears, i.e., signals of left and right microphones) is extracted for each frequency by decomposing the signal into a kernel function for each channel and a combination of coefficients using the matching pursuit algorithm and then detecting the maximum coefficient for each of the channels.
- the extracted time difference is referred to as an SITD named after a sparse ITD.
- the extracted SITD is transferred to the neural network, i.e., the sound source localization unit 420 , so that the sound source is localized.
- the signal inputted with 16 KHz is divided by 200 msec to use 3200 data. Then, 25% of the data is overlapped in the calculation of the next frame.
- SITDs 64 channels. However, when all the channels are used, this may have a bad influence on the sound source localization due to problems of an environmental noise, a small coefficient and the like. In order to remove such an influence, the aforementioned three types of filters 414 are used in this embodiment.
- a first filter is referred to as a mean-variance filter.
- the first filter is a filter that evaluates the Gaussian mean of the SITDs and removes SITDs of which errors are greater than a predetermined value based on the evaluated mean.
- the predetermined value is a value predetermined by a user within an error range that is not considered as a normal signal.
- a second filter is a bandpass filter in which only the SITD result of the gammatone filterbank in a corresponding region is used in a voice band.
- the sound band refers to a band of 500 to 4000 Hz.
- a third filter is a filter that removes errors when the coefficient of the extracted SITD is smaller than a specific threshold determined by a user.
- filters are referred to as first, second and third filters, respectively, the order of the filters is not particularly limited. Each of the filters is not essential, and some or all of the filters may be deleted or added as occasion demands. The filters are provided only for illustrative purposes, and other types of filters may be used.
- FIGS. 5A to 5D are graphs showing results obtained by applying filters of the sound source localization system according to the embodiment.
- FIG. 5A is a graph of an SITD that does not passes through filtering. That is, the SITD that passes through the gammatone filterbank, the sparse coding and the like is represented by a spike-gram as shown in FIG. 5A .
- FIG. 5A it can be seen that calculated values are not equal and values with large errors exist.
- FIG. 5B shows a result obtained by passing through the first filter
- FIG. 5C shows a result obtained by sequentially passing through the first and second filters
- FIG. 5D shows a result obtained by sequentially passing through the first, second and third filters.
- the order of the filtering processes is not particularly limited, and the same result is derived even though the order of the filtering processes is changed. Any one of the filtering processes may be deleted or added as occasion demands, and the result becomes more accurate as the number of filtering processes is increased.
- the SITD results are equalized as the filtering processes are performed one by one.
- the SITD that passes through the aforementioned filtering processes is inputted to the neural network, i.e., the sound source localization unit 420 as an input.
- the sound source localization unit 420 in the sound source system may use a self-organizing map (SOM) 421 that is one of neural networks.
- SOM self-organizing map
- ITDs are calculated using the head related transfer function (HRTF) at each frequency bandwidth.
- HRTF head related transfer function
- impulse responses are necessarily measured by changing an angle and generating a sound source in a dead room. Hence, many costs and resources are consumed in constructing the system.
- the SOM of the sound source localization unit 420 in the sound source localization system a learning process is performed using the system constructed in the initialized SOM and the SITD estimated through the neural coding in an actual environment, and the result is then estimated from the SOM.
- the on-line learning of the SOM is possible. Therefore, the SOM can be adapted to a change in ambient environment, hardware or the like, as the same principle that a human being is adapted to a change in the function of an auditory sense.
- the localization of the sound source 430 can be performed by passing the inputted sound source signal through the time-difference extraction unit 410 and the sound source localization unit 420 .
- FIG. 6 is a flowchart schematically illustrating a sound source localization method according to an embodiment.
- a signal is received as an input from a sound source (S 601 ). Subsequently, the inputted signal is decomposed into time, frequency and amplitude using a sparse coding (S 602 ). Then, an SITD is extracted for each frequency using the separated signal (S 603 ).
- the SITDs are filtered by several filters (S 604 ).
- the SITDs may be filtered by first, second and third filters.
- the first filter is a filter that evaluates the Gaussian mean of the SITDs and removes SITDs of which errors are greater than a predetermined value based on the evaluated mean.
- the second filter is a filter that passes only SITDs within a voice band among the SITDs.
- the third filter is a filter that passes only SITDs of which coefficients are smaller than a predetermined threshold.
- these filters are referred to as first, second and third filters, respectively, the order of the filters is not particularly limited. Each of the filters is not essential, and some or all of the filters may be deleted or added as occasion demands.
- the filters are provided only for illustrative purposes, and other types of filters may be used.
- the sound source is localized using the SITDs that pass through the aforementioned filtering processes (S 605 ).
- the operation S 605 can be performed by learning the SITDs and localizing the sound source using the learned SITDs.
- the sound source localization method described above has been described with reference to the flowchart shown in FIG. 6 .
- the method is illustrated and described using a series of blocks.
- the order of the blocks is not particularly limited, and some blocks may be performed simultaneously or in a different order from the order illustrated and described in this specification.
- various orders of other branches, flow paths and blocks may be implemented to achieve the identical or similar result. All the blocks shown in FIG. 6 may not be required to implement the method described in this specification.
- a sparse coding and a self-organized map are used to implement sound source localization using a sound source localization path of a human being as a model, so that the system and method can be applied to various types of platforms because impulse responses are unnecessary to be measured every time, and can be used in various robot development fields because it is possible to be adapted to a change in environment.
Landscapes
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Otolaryngology (AREA)
- General Health & Medical Sciences (AREA)
- Stereophonic System (AREA)
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
- This application claims priority from and the benefit of Korean Patent Application No. 10-2010-0022697, filed on Mar. 15, 2010, which is hereby incorporated by reference for all purposes as if fully set forth herein.
- 1. Field of the Invention
- Disclosed herein is a sound source localization system and method.
- 2. Description of the Related Art
- In general, among auditory techniques for intelligent robots, a sound source localization technique is a technique for localizing the position at which a sound source is generated by analyzing properties of a signal inputted from a microphone array. That is, the sound source localization technique is a technique capable of effectively localizing a sound source generated from a human robot interaction and a place beyond the sight of a vision camera.
-
FIG. 1 is a diagram showing a related art sound source localization technique using a microphone array. - In related art sound source localization techniques, a microphone array has the form of a specific structure as shown in
FIG. 1 , and a sound source is localized using such a microphone array. In the technique, a direction angle is mainly detected by measuring a difference in time at which a voice signal reaches each microphone from the sound source. Hence, when using the technique, an object that interrupt the flow of a voice signal between the respective microphones is not necessarily exists so that exact measurement is possible. However, in the case of using two ears of an actual human being, there may occur a problem in the sound source localization technique. -
FIG. 2 is a diagram illustrating a problem caused when the related art sound source localization technique is applied to a sound source localization technique using two ears. - Referring to
FIG. 2 , when the related art sound source localization technique is used in an actual robot technique using two ears, properties of a signal inputted to the two ears from a sound source are changed due to the influence of a face and an ear between microphones, and therefore, performance may be degraded. - A method using a head related transfer function (HRTF) has been proposed so as to solve such a problem. In the method using the HRTF, the influence caused by a platform is removed by re-measuring respective impulse responses based on the forms of the corresponding platform. However, in order to measure impulse responses, signals based on respective directions are necessarily obtained in a dead room, and hence, measurement is complicated whenever the form of the platform is changed. Therefore, the method using the HRTF has a limitation in its application to robot auditory systems with various types of platforms.
- In addition, since related art sound source localization systems are sensitively reacted to changes in environment, programs and the like are necessarily modified to make a setting suitable for a change in environment. Therefore, there are many problems in that the related art sound source localization systems are applied to the human robot interaction in which various variables still exist.
- Disclosed herein is a sound source localization system and method in which a sparse coding and a self-organized map (SOM) are used to implement sound source localization using a sound source localization path of a human being as a model, so that the system and method can be applied to various types of platforms because impulse responses are unnecessary to be measured every time, and can be used in various robot development fields because it is possible to be adapted to a change in environment.
- In one embodiment, there is provided a sound source localization system including: a plurality of microphones for receiving a signal as an input from a sound source; a time-difference extraction unit for decomposing the signal inputted through the plurality of microphones into time, frequency and amplitude using a sparse coding and then extracting a sparse interaural time difference (SITD) inputted through the plurality of microphones for each frequency; and a sound source localization unit for localizing the sound source using the SITDs.
- In one embodiment, there is provided a sound source localization method including: receiving a signal as an input from a sound source; decomposing the signal into time, frequency and amplitude using a sparse coding; extracting an SITD for each frequency; and localizing the sound source using the SITDs.
- The above and other aspects, features and advantages disclosed herein will become apparent from the following description of preferred embodiments given in conjunction with the accompanying drawings, in which:
-
FIG. 1 is a diagram showing a related art sound source localization technique using a microphone array; -
FIG. 2 is a diagram illustrating a problem caused when the related art sound source localization technique is applied to a sound source localization technique using two ears; -
FIG. 3 is a diagram illustrating a correspondence relation between a human's sound source localization system and a sound source localization system according to an embodiment; -
FIG. 4 is a block diagram schematically showing the sound source localization system according to the embodiment; -
FIGS. 5A to 5D are graphs showing results obtained by applying filters of the sound source localization system according to the embodiment; and -
FIG. 6 is a flowchart schematically illustrating a sound source localization method according to an embodiment. - Exemplary embodiments now will be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments are shown. This disclosure may, however, be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth therein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of this disclosure to those skilled in the art. In the description, details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the presented embodiments.
- The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of this disclosure. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, the use of the terms a, an, etc. does not denote a limitation of quantity, but rather denotes the presence of at least one of the referenced item. The use of the terms “first”, “second”, and the like does not imply any particular order, but they are included to identify individual elements. Moreover, the use of the terms first, second, etc. does not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another. It will be further understood that the terms “comprises” and/or “comprising”, or “includes” and/or “including” when used in this specification, specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, regions, integers, steps, operations, elements, components, and/or groups thereof.
- Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
- In the drawings, like reference numerals in the drawings denote like elements. The shape, size and regions, and the like, of the drawing may be exaggerated for clarity.
-
FIG. 3 is a diagram illustrating a correspondence relation between a human's sound source localization system and a sound source localization system according to an embodiment. - Referring to
FIG. 3 , a generated sound source signal is inputted through two (two-channel) microphones attached to an artificial ear (Kemar ear) 301 corresponding to a human'sear 301′. Then, the inputted sound source signal is digitalized for sound source localization. Since the inputted signal is processed based on a cognition model of a human's auditory sense, the sound source localization system corresponds to organs that play roles in the human's auditory sense. The localization of the inputted sound source divided into two parts, i.e., aneural coding 302 and aneural network 303. The part of theneural coding 302 serves as a medial superior olive (MSO) 302′ that extracts a sparse interaural time difference (SITD) used for the sound source localization. The part of theneural network 303 serves as an inferior colliculus (IC) 303′ that localizes a sound source and plays a role of learning. Like the sound source localization performed in anauditory cortex 304′, asound source localization 304 is also performed in the sound source localization system according to the embodiment by passing through the parts of theneural coding 302 and theneural network 303. - It has been described in this embodiment that the number of microphones used is two. However, this is provided only for illustrative purposes, and is not limited thereto. That is, the sound source localization system according to the embodiment may be provided with three or more microphones as occasion demands. For example, the sound source localization system according to the embodiment may be applied in such a manner that a plurality of microphones are divided into two groups and the two groups are respectively disposed at the left and right of a model with the contour of a human's face, or the like.
-
FIG. 4 is a block diagram schematically showing the sound source localization system according to the embodiment. - As previously described in
FIG. 3 , the sound source localization system according to the embodiment is generally divided into a neural coding and a neural network. Referring toFIG. 4 , since the neural coding extracts an SITD, it may correspond to a time-difference extraction unit 410. Since the neural network localizes a sound source using the SITD, it may correspond to a soundsource localization unit 420. - The algorithm of the time-
difference extraction unit 410 may be performed as follows. A sound source signal 400 is first inputted through two (two-channel) microphones and then digitalized for signal processing. When the inputted sound source signal 400 is digitalized, it may be digitalized at a desired sampling rate, e.g., 16 kHz. The digitalized sound source signal 411 may be inputted as a unit of frame (200 ms) to agammatone filterbank 412 having 64 different center frequencies. Here, the digitalized sound source signal 411 may be filtered for each of the frequencies and then inputted to asparse coding 413. An SITD may be evaluated by passing through thesparse coding 413, and errors may be removed from the evaluated SITD by passing through three types offilters 414. The three types offilters 414 will be described later. - The algorithm of the time-
difference extraction unit 410 will be described in detail. As described above, the sound source signal 400 is inputted through the two (two-channel) microphones and then digitalized. The digitalized sound source signal is divided as a unit of frame (200 ms) and then transferred to agammatone filterbank 412. Here, if the sound source localization is performed by two artificial ears disposed as human's ears, the SITD is changed by the influence of a facial surface. In order to effectively solve such a problem, the SITD is necessarily evaluated, and hence, thegammatone filterbank 412 for filtering the sound source signal for each frequency is used in the sound source localization system according to the embodiment. Thegammatone filterbank 412 is a filter structure obtained by performing modeling with respect to sound processing in a human's outer ear. Particularly, as thegammatone filterbank 412 includes a set of bandpass filters that serve as cochleae, the impulse response of the filterbank is evaluated using a gammatone function as shown in thefollowing equation 1. -
h(t)=r(n,b)t n-1 e −bt cos(ωt+φ)u(t) (1) - Here, r(n,b) denotes a normalization factor, b denotes a bandwidth, and w denotes a center frequency.
- As can be seen in
Equation 1, the number of filters and the center frequency and bandwidth of the filterbank are required to produce the gammatone filterbank. Generally, the number of filters is determined by the maximum frequency (fH) and the minimum frequency (fL). The number of filters is evaluated by thefollowing equation 2. In this embodiment, the maximum and minimum frequencies are set as 100 Hz and 8 KHz, respectively, and the number of filters is then evaluated. -
- Here, v denotes the number of overlapped filters. The center frequency is evaluated by the
following equation 3. -
- The number of filters and the center frequency of the filterbank are evaluated using the aforementioned equations, and 64 gammatone filters are then produced by applying the bandwidth of an equivalent rectangular bandwidth (ERB) filter. The ERB filter is a filter proposed on the assumption that the auditory filter has a rectangular shape and the same noise power is passed in the same critical bandwidth. The bandwidth of the ERB filter is generally used for the gammatone filter.
- In this embodiment, the technique of a
sparse coding 412 is used in which the inputted sound source signal is decomposed into three factors of time, frequency and amplitude. In the technique of thesparse coding 412, a general signal is decomposed into three factors of time, frequency and amplitude by the following equation 4, using a sparse and kernel method. -
- Here, Ti m denotes a time, Si m denotes a coefficient of an i-th time, φm denotes a kernel function, nm denotes the number of kernel functions, and ε(t) denotes a noise. As can be seen in Equation 4, all signals can be expressed as the sum of coefficients of the kernel functions at a time t and noises using the sparse and kernel method. The kernel function disclosed herein is a gammatone filterbank. Since the gammatone filterbank has various frequency bands, each of the signals may be decomposed into three factors of time, frequency and amplitude.
- Here, various algorithms may be used to decompose the inputted signal into the generated kernel function. A matching pursuit algorithm has been used in this embodiment. The time difference between two channels (signals of left and right ears, i.e., signals of left and right microphones) is extracted for each frequency by decomposing the signal into a kernel function for each channel and a combination of coefficients using the matching pursuit algorithm and then detecting the maximum coefficient for each of the channels. The extracted time difference is referred to as an SITD named after a sparse ITD. The extracted SITD is transferred to the neural network, i.e., the sound
source localization unit 420, so that the sound source is localized. - When the SITD is calculated in the sparse coding, the signal inputted with 16 KHz is divided by 200 msec to use 3200 data. Then, 25% of the data is overlapped in the calculation of the next frame. In one frame, there exist SITDs of 64 channels. However, when all the channels are used, this may have a bad influence on the sound source localization due to problems of an environmental noise, a small coefficient and the like. In order to remove such an influence, the aforementioned three types of
filters 414 are used in this embodiment. - A first filter is referred to as a mean-variance filter. The first filter is a filter that evaluates the Gaussian mean of the SITDs and removes SITDs of which errors are greater than a predetermined value based on the evaluated mean. The predetermined value is a value predetermined by a user within an error range that is not considered as a normal signal. A second filter is a bandpass filter in which only the SITD result of the gammatone filterbank in a corresponding region is used in a voice band. The sound band refers to a band of 500 to 4000 Hz. A third filter is a filter that removes errors when the coefficient of the extracted SITD is smaller than a specific threshold determined by a user.
- Although the aforementioned filters are referred to as first, second and third filters, respectively, the order of the filters is not particularly limited. Each of the filters is not essential, and some or all of the filters may be deleted or added as occasion demands. The filters are provided only for illustrative purposes, and other types of filters may be used.
-
FIGS. 5A to 5D are graphs showing results obtained by applying filters of the sound source localization system according to the embodiment. -
FIG. 5A is a graph of an SITD that does not passes through filtering. That is, the SITD that passes through the gammatone filterbank, the sparse coding and the like is represented by a spike-gram as shown inFIG. 5A . InFIG. 5A , it can be seen that calculated values are not equal and values with large errors exist. -
FIG. 5B shows a result obtained by passing through the first filter, andFIG. 5C shows a result obtained by sequentially passing through the first and second filters.FIG. 5D shows a result obtained by sequentially passing through the first, second and third filters. As described above, the order of the filtering processes is not particularly limited, and the same result is derived even though the order of the filtering processes is changed. Any one of the filtering processes may be deleted or added as occasion demands, and the result becomes more accurate as the number of filtering processes is increased. As can be seen inFIGS. 5B to 5D , the SITD results are equalized as the filtering processes are performed one by one. - Referring back to
FIG. 4 , the SITD that passes through the aforementioned filtering processes is inputted to the neural network, i.e., the soundsource localization unit 420 as an input. - The sound
source localization unit 420 in the sound source system according to the embodiment may use a self-organizing map (SOM) 421 that is one of neural networks. As described in the background section, in the related art sound source localization system, ITDs are calculated using the head related transfer function (HRTF) at each frequency bandwidth. However, in order to precisely implement the HRTF, impulse responses are necessarily measured by changing an angle and generating a sound source in a dead room. Hence, many costs and resources are consumed in constructing the system. - Contrastively, in the SOM of the sound
source localization unit 420 in the sound source localization system according to the embodiment, a learning process is performed using the system constructed in the initialized SOM and the SITD estimated through the neural coding in an actual environment, and the result is then estimated from the SOM. Unlike the general neural network, the on-line learning of the SOM is possible. Therefore, the SOM can be adapted to a change in ambient environment, hardware or the like, as the same principle that a human being is adapted to a change in the function of an auditory sense. - The localization of the
sound source 430 can be performed by passing the inputted sound source signal through the time-difference extraction unit 410 and the soundsource localization unit 420. -
FIG. 6 is a flowchart schematically illustrating a sound source localization method according to an embodiment. - In the sound source localization method according to the embodiment, a signal is received as an input from a sound source (S601). Subsequently, the inputted signal is decomposed into time, frequency and amplitude using a sparse coding (S602). Then, an SITD is extracted for each frequency using the separated signal (S603).
- The SITDs are filtered by several filters (S604). For example, the SITDs may be filtered by first, second and third filters. Here, the first filter is a filter that evaluates the Gaussian mean of the SITDs and removes SITDs of which errors are greater than a predetermined value based on the evaluated mean. The second filter is a filter that passes only SITDs within a voice band among the SITDs. The third filter is a filter that passes only SITDs of which coefficients are smaller than a predetermined threshold. Although these filters are referred to as first, second and third filters, respectively, the order of the filters is not particularly limited. Each of the filters is not essential, and some or all of the filters may be deleted or added as occasion demands. The filters are provided only for illustrative purposes, and other types of filters may be used.
- The sound source is localized using the SITDs that pass through the aforementioned filtering processes (S605). The operation S605 can be performed by learning the SITDs and localizing the sound source using the learned SITDs.
- The sound source localization method described above has been described with reference to the flowchart shown in
FIG. 6 . For brief description, the method is illustrated and described using a series of blocks. However, the order of the blocks is not particularly limited, and some blocks may be performed simultaneously or in a different order from the order illustrated and described in this specification. Also, various orders of other branches, flow paths and blocks may be implemented to achieve the identical or similar result. All the blocks shown inFIG. 6 may not be required to implement the method described in this specification. - In the sound source localization system and method, disclosed herein, a sparse coding and a self-organized map (SOM) are used to implement sound source localization using a sound source localization path of a human being as a model, so that the system and method can be applied to various types of platforms because impulse responses are unnecessary to be measured every time, and can be used in various robot development fields because it is possible to be adapted to a change in environment.
- While the disclosure has been described in connection with certain exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims, and equivalents thereof.
Claims (12)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR10-2010-0022697 | 2010-03-15 | ||
KR1020100022697A KR101090893B1 (en) | 2010-03-15 | 2010-03-15 | Sound source localization system |
Publications (2)
Publication Number | Publication Date |
---|---|
US20110222707A1 true US20110222707A1 (en) | 2011-09-15 |
US8270632B2 US8270632B2 (en) | 2012-09-18 |
Family
ID=44559985
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/844,004 Active 2031-03-24 US8270632B2 (en) | 2010-03-15 | 2010-07-27 | Sound source localization system and method |
Country Status (2)
Country | Link |
---|---|
US (1) | US8270632B2 (en) |
KR (1) | KR101090893B1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120109375A1 (en) * | 2009-06-26 | 2012-05-03 | Lizard Technology | Sound localizing robot |
US20130096922A1 (en) * | 2011-10-17 | 2013-04-18 | Fondation de I'Institut de Recherche Idiap | Method, apparatus and computer program product for determining the location of a plurality of speech sources |
CN103985390A (en) * | 2014-05-20 | 2014-08-13 | 北京安慧音通科技有限责任公司 | Method for extracting phonetic feature parameters based on gammatone relevant images |
US20150289075A1 (en) * | 2013-05-17 | 2015-10-08 | Canon Kabushiki Kaisha | Method for determining a direction of at least one sound source from an array of microphones |
US10063965B2 (en) * | 2016-06-01 | 2018-08-28 | Google Llc | Sound source estimation using neural networks |
CN111462766A (en) * | 2020-04-09 | 2020-07-28 | 浙江大学 | Auditory pulse coding method and system based on sparse coding |
CN112904279A (en) * | 2021-01-18 | 2021-06-04 | 南京工程学院 | Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9881616B2 (en) * | 2012-06-06 | 2018-01-30 | Qualcomm Incorporated | Method and systems having improved speech recognition |
US9395723B2 (en) | 2013-09-30 | 2016-07-19 | Five Elements Robotics, Inc. | Self-propelled robot assistant |
US9883142B1 (en) | 2017-03-21 | 2018-01-30 | Cisco Technology, Inc. | Automated collaboration system |
US11190896B1 (en) | 2018-09-27 | 2021-11-30 | Apple Inc. | System and method of determining head-related transfer function parameter based on in-situ binaural recordings |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6719700B1 (en) * | 2002-12-13 | 2004-04-13 | Scimed Life Systems, Inc. | Ultrasound ranging for localization of imaging transducer |
US7495998B1 (en) * | 2005-04-29 | 2009-02-24 | Trustees Of Boston University | Biomimetic acoustic detection and localization system |
US7586513B2 (en) * | 2003-05-08 | 2009-09-08 | Tandberg Telecom As | Arrangement and method for audio source tracking |
US20100217590A1 (en) * | 2009-02-24 | 2010-08-26 | Broadcom Corporation | Speaker localization system and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR100943224B1 (en) | 2007-10-16 | 2010-02-18 | 한국전자통신연구원 | An intelligent robot for localizing sound source by frequency-domain characteristics and method thereof |
-
2010
- 2010-03-15 KR KR1020100022697A patent/KR101090893B1/en active IP Right Grant
- 2010-07-27 US US12/844,004 patent/US8270632B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6719700B1 (en) * | 2002-12-13 | 2004-04-13 | Scimed Life Systems, Inc. | Ultrasound ranging for localization of imaging transducer |
US7586513B2 (en) * | 2003-05-08 | 2009-09-08 | Tandberg Telecom As | Arrangement and method for audio source tracking |
US7495998B1 (en) * | 2005-04-29 | 2009-02-24 | Trustees Of Boston University | Biomimetic acoustic detection and localization system |
US20100217590A1 (en) * | 2009-02-24 | 2010-08-26 | Broadcom Corporation | Speaker localization system and method |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120109375A1 (en) * | 2009-06-26 | 2012-05-03 | Lizard Technology | Sound localizing robot |
US20130096922A1 (en) * | 2011-10-17 | 2013-04-18 | Fondation de I'Institut de Recherche Idiap | Method, apparatus and computer program product for determining the location of a plurality of speech sources |
US9689959B2 (en) * | 2011-10-17 | 2017-06-27 | Foundation de l'Institut de Recherche Idiap | Method, apparatus and computer program product for determining the location of a plurality of speech sources |
US20150289075A1 (en) * | 2013-05-17 | 2015-10-08 | Canon Kabushiki Kaisha | Method for determining a direction of at least one sound source from an array of microphones |
US9338571B2 (en) * | 2013-05-17 | 2016-05-10 | Canon Kabushiki Kaisha | Method for determining a direction of at least one sound source from an array of microphones |
CN103985390A (en) * | 2014-05-20 | 2014-08-13 | 北京安慧音通科技有限责任公司 | Method for extracting phonetic feature parameters based on gammatone relevant images |
US10063965B2 (en) * | 2016-06-01 | 2018-08-28 | Google Llc | Sound source estimation using neural networks |
CN111462766A (en) * | 2020-04-09 | 2020-07-28 | 浙江大学 | Auditory pulse coding method and system based on sparse coding |
CN112904279A (en) * | 2021-01-18 | 2021-06-04 | 南京工程学院 | Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum |
Also Published As
Publication number | Publication date |
---|---|
KR20110103572A (en) | 2011-09-21 |
US8270632B2 (en) | 2012-09-18 |
KR101090893B1 (en) | 2011-12-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8270632B2 (en) | Sound source localization system and method | |
Deleforge et al. | Acoustic space learning for sound-source separation and localization on binaural manifolds | |
CN110515456B (en) | Electroencephalogram signal emotion distinguishing method and device based on attention mechanism | |
Nakadai et al. | Applying scattering theory to robot audition system: Robust sound source localization and extraction | |
Buchner et al. | TRINICON: A versatile framework for multichannel blind signal processing | |
EP1818909B1 (en) | Voice recognition system | |
Kounades-Bastian et al. | A variational EM algorithm for the separation of time-varying convolutive audio mixtures | |
US6408269B1 (en) | Frame-based subband Kalman filtering method and apparatus for speech enhancement | |
Deleforge et al. | The cocktail party robot: Sound source separation and localisation with an active binaural head | |
Desai et al. | A review on sound source localization systems | |
Nakadai et al. | Epipolar geometry based sound localization and extraction for humanoid audition | |
CN111046840B (en) | Personnel safety monitoring method and system based on artificial intelligence in pollution remediation environment | |
Voutsas et al. | A biologically inspired spiking neural network for sound source lateralization | |
Taghizadeh et al. | Ad hoc microphone array calibration: Euclidean distance matrix completion algorithm and theoretical guarantees | |
Anumula et al. | An event-driven probabilistic model of sound source localization using cochlea spikes | |
Weiss et al. | Combining localization cues and source model constraints for binaural source separation | |
Liu et al. | Head‐related transfer function–reserved time‐frequency masking for robust binaural sound source localization | |
Nix et al. | Combined estimation of spectral envelopes and sound source direction of concurrent voices by multidimensional statistical filtering | |
Geirnaert et al. | EEG-based auditory attention decoding: Towards neuro-steered hearing devices | |
Zolfaghari et al. | Large deformation diffeomorphic metric mapping and fast-multipole boundary element method provide new insights for binaural acoustics | |
CN112731291B (en) | Binaural sound source localization method and system for collaborative two-channel time-frequency mask estimation task learning | |
Deleforge et al. | Towards a generalization of relative transfer functions to more than one source | |
Gajecki et al. | Deep latent fusion layers for binaural speech enhancement | |
Đurković | Localization, tracking, and separation of sound sources for cognitive robots | |
Sutojo et al. | A distance measure to combine monaural and binaural auditory cues for sound source segregation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: KOREA INSTITUTE OF SCIENCE AND TECHNOLOGY, KOREA, Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HWANG, DO HYUNG;CHOI, JONGSUK;REEL/FRAME:024745/0202 Effective date: 20100720 |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
SULP | Surcharge for late payment | ||
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2552); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 8 |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 12TH YR, SMALL ENTITY (ORIGINAL EVENT CODE: M2553); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY Year of fee payment: 12 |