CN113419216B

CN113419216B - Multi-sound source positioning method suitable for reverberant environment

Info

Publication number: CN113419216B
Application number: CN202110684270.9A
Authority: CN
Inventors: 胡秋岑; 吴礼福
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2023-10-31
Anticipated expiration: 2041-06-21
Also published as: CN113419216A

Abstract

The invention provides a multi-sound source localization method suitable for a reverberation environment, which comprises the steps of grouping all coordinates of a whole search area and calculating the center coordinates of each group; collecting a voice signal using a 16 microphone array; positioning a certain sound source by using a multi-sound source positioning algorithm of double-layer search space clustering (TL-SSC), and removing coordinates near the sound source in a search area; this operation is repeated until all sound sources are located. The method solves the problem of real-time localization of multiple sound sources under the reverberation condition, and compared with other multi-sound source localization methods, the method has the advantages that the number of the required microphones is small, the calculation efficiency can be improved under the condition of keeping high localization precision, and the real-time requirement of mobile robot application is met.

Description

Multi-sound source positioning method suitable for reverberant environment

Technical Field

The invention relates to the technical field of sound source localization, in particular to a multi-sound source localization method suitable for a reverberation environment.

Background

The multi-sound source localization has a wide demand in real-time systems such as video conferences, voice recognition, mobile service robots and the like, and is one of research hotspots in the field of acoustic signal processing. For example, when the mobile robot performs real-time intelligent service, the voice position is determined by a multi-sound source positioning method, and the robot is guided to complete the service. The existing sound source localization methods mainly comprise three types: positioning method based on subspace, positioning method based on controllable beam forming and positioning method based on arrival time delay. The subspace-based positioning method receives signals through each microphone, utilizes orthogonality of the signal subspace and the noise subspace, constructs a spatial spectrum function and searches a spectrum peak to obtain a sound source direction, has high positioning accuracy, has higher requirements on stability of sound source signals, and has poor positioning effect in a small space. The positioning method based on controllable beam forming selects the direction corresponding to the maximum receiving power value as the sound source direction by changing the direction of the receiving signals of the microphone array, and has simple principle and small calculated amount, but has poor noise resistance, environmental noise information needs to be acquired in advance, and positioning instantaneity is difficult to ensure. The positioning method based on the arrival time delay utilizes the sound path difference from the sound source to each microphone to determine the sound source position, the calculation complexity is generally smaller than that of the two methods, the positioning precision is higher, and the positioning instantaneity is easy to meet.

Disclosure of Invention

Aiming at the application occasions with higher requirements on algorithm real-time performance, such as an indoor mobile robot, in the prior art, namely the problem that the calculation efficiency is improved and the space is improved while the accuracy is maintained as much as possible in a small space, the invention provides a multi-sound-source positioning method suitable for a reverberation environment, wherein the number of microphones adopted is 16 by the multi-sound-source positioning method based on double-layer search space clustering (Two-Levels Search Space Clustering, TL-SSC), the calculation efficiency of a system is improved by grouping coordinates, carrying out real-time double-layer search, clustering screening, threshold judgment and the like, and the real-time positioning of the multi-sound-source is realized by utilizing the estimation of arrival time delay (Time Difference of Arrival, TDOA).

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a multi-sound source localization method suitable for reverberant environments comprises the following steps:

s1, collecting coordinates in the whole search area, grouping, and calculating the center coordinates of each group;

s2, collecting voice signals by using a microphone array;

s3, determining a candidate subgroup of a certain sound source position by adopting a multi-sound source positioning algorithm of double-layer search space clustering in a mode of calculating the central coordinate power of each subgroup, positioning the sound source position in all coordinates contained in the candidate subgroup, and removing coordinates near the sound source in a search area;

repeating the step S3 until all the sound source positions are positioned.

In order to optimize the technical scheme, the specific measures adopted further comprise:

further, the grouping in step S1 is based on:

if the ith coordinate q _i Belonging to the j-th group z _j Then p (q _i ∈z _j ) Has a value of 1; if the ith coordinate q _i Not belonging to the j-th group z _j ，p(q _i ∈z _j ) The value of (2) is 0;

wherein I represents the total number of coordinates in the whole search area, J represents the current group number, z _j Represented as all coordinate sets in the j-th group; wherein the initial value of J is 1, and sequentially adding 1 until the formulaIt holds that with the change of i and j, p (q _i ∈z _j )、e(q _i ，z _j ) And z _j The center coordinates of the (B) are calculated through a K-means algorithm;

wherein e (q) _i ,z _j ) Representation groupError convergence, defined as the difference in sound path between all microphone pairsAnd (3) summing; />Representing the distance position q between microphone k and microphone l _i TDOA value of>Representing the distance set z between microphone k and microphone l _j TDOA value of center coordinates, M represents the number of microphones, θ _t Represented as a threshold.

Further, the threshold value theta _t The definition is as follows:

where λ represents the wavelength, c is the speed of sound, and f is the sampling rate; in sound source localization, θ _t The value is determined by the maximum frequency of the speech signal.

Further, the microphone array collects voice signals by adopting 16 microphones, the whole microphone array is a cylinder, and 8 microphones are uniformly distributed on the upper and lower outlines respectively.

Further, in the time domain, the mode of calculating the coordinate power is specifically:

wherein y (t, q) represents the output value of the coordinate position q at time t, g _m (t) represents the impulse response of the filter at the mth microphone, x _m (t+τ _m，q ) Representing the mth microphone at time t+τ _m，q Received signal τ _m，q Representing signal propagation times for coordinate position q through the mth microphone; in the frequency domain, the formula for calculating the coordinate power is expressed as:

wherein Y (ω, q) is expressed as an output value of the coordinate position q at the frequency ω, X _m (omega) Fourier transform denoted as mth microphone signal, G _m (ω) represents the frequency domain system function of the filter at the mth microphone;

based on a formula for calculating coordinate power in a frequency domain, a power output value P (q) of the coordinate position q is obtained as follows:

wherein G is _l (omega) is expressed as a frequency domain system function of the filter at the first microphone, X _l (ω) is denoted as the fourier transform of the first microphone signal,expressed as the conjugate of the frequency domain system function of the filter at the kth microphone,/i>Represented as the conjugate of the fourier transform of the kth microphone signal, τ _k,q Representing signal propagation times for coordinate position q through the kth microphone;

in the method, in the process of the invention, a PHAT weighting coefficient between the ith microphone signal and the kth microphone signal;

after calculating the central coordinate power of each group, determining a candidate group, locating a sound source position in all coordinates contained in the candidate group, and determining the sound source position as the maximum power valueCorresponding coordinates, namely:

further, the specific way to determine a certain sound source candidate group is: searching according to the result of calculating the central coordinate power value of each subgroup, selecting the subgroup corresponding to the maximum power value as a first candidate subgroup, and when judging the v-th subgroup in the rest subgroups, selecting the subgroup as the candidate subgroup under the following conditions:

|θ _v -θ _c |≤θ ₁ ，

terminating the judgment of the candidate subgroup after the number u of the candidate subgroups reaches a certain number or all subgroups are judged;

wherein (X) _b ,Y _b ,Z _b ) Expressed as the center coordinates of the b-th subgroup of the existing candidate subgroups in the Cartesian coordinate system, (X) _cc ,Y _cc ，Z _cc ) Expressed as average coordinates, theta, of all the current candidate group center coordinates averaged in a Cartesian coordinate system _c For the direction of the average coordinates after averaging the central coordinates of all the current candidate subgroupsThe angle of the corner of the plate,for the elevation angle of the average coordinate averaged over the center coordinates of all the candidate subgroups at present, θ _v Azimuth angle expressed as the central coordinate position of the current group to be discriminated, +.>Elevation angle, θ, expressed as center coordinates of the current group to be discriminated ₁ Expressed as azimuth threshold, ++>Represented as elevation threshold.

Further, the specific content of removing the coordinates near the sound source in the search area is:

providing a region omega, uniformly reducing the power of the coordinate positions in the group and giving a power value E to the group contained in the region omega _l Meanwhile, the small groups contained in the region omega are not considered in the subsequent step of positioning other sound source positions;

coordinates within a region Ω in a spherical coordinate systemThe requirements are as follows:

|θ-θ _s |≤θ ₂ ，

where, θ is expressed as azimuth of the coordinates,expressed as elevation angle of the coordinates, r represents distance of the coordinates from origin of the coordinate system, θ _s Azimuth angle, expressed as the last sound source coordinate position of the currently located sound source, +>Elevation angle, θ, expressed as the last sound source coordinate position of the currently located sound source ₂ Expressed as azimuth threshold, ++>Represented as elevation threshold.

The beneficial effects of the invention are as follows:

1. according to the multi-sound source localization method suitable for the reverberation environment, through the steps of clustering screening, threshold judgment and the like, a proper removal area and candidate group screening conditions are selected, so that the TL-SSC algorithm can be applied to a multi-sound source system.

2. Compared with other multi-sound source localization methods, the multi-sound source localization method suitable for the reverberation environment has the advantages that the number of required microphones is small, the calculation efficiency can be improved under the condition that high localization accuracy is kept, and the real-time requirement of mobile robot application is met.

Drawings

FIG. 1 is a schematic diagram of a microphone and sound source distribution of the present invention; in the figure: pentagram represents a microphone and dot represents a sound source.

Detailed Description

The invention will now be described in further detail with reference to the accompanying drawings.

The localization of two sound sources among the plurality of sound sources is described as an example.

Grouping the primary coordinates in the whole search area, and calculating the center coordinates of each group, wherein the grouping basis is as follows:

wherein I represents the total number of primary coordinates in the whole search area, J represents the current group number, M represents the number of microphones, and θ _t Represented as a threshold. z _j Expressed as all coordinate sets in the j-th group, if the i-th coordinate q _i Belonging to the j-th group z _j Then p (q _i ∈z _j ) 1 is shown in the specification; otherwise, 0.e (q) _i ,z _j ) Representing a group error, defined as the difference in sound path between all microphone pairsAnd (3) summing; />Representing the distance position q between microphone k and microphone l _i Is to determine the TDOA value of->Values. The initial value of J is 1, and the increment is 1 each time until the formula

This is true. As J increases, p (q _i ∈z _j )，e(q _i ,z _j ) And z _j Will be calculated by the K-means algorithm. The threshold θ is defined as:

Candidate subgroups of one of the sound sources (the sound source that is not the first to be localized) are screened out and localized. Calculating the power value corresponding to each subgroup according to the obtained grouping result, carrying out first layer search, and selecting the subgroup corresponding to the maximum power value of the central coordinate asA first candidate group. Suppose that u is already in the first tier of searches ₁ The subgroup is selected as candidate subgroup, and the v-th subgroup is left to select the subgroup belonging to the same sound source as the existing candidate subgroup ₁ When the group is judged, the conditions for selecting the group as a candidate group are as follows:

up to u ₁ Up to a certain number n or all subgroups are discriminated. Wherein the method comprises the steps ofExpressed as b in the existing candidate group in Cartesian coordinate system ₁ Center coordinates of the groups, (X _c1 ,Y _c1 ,Z _c1 ) Expressed as average coordinates, theta, of all the current candidate group center coordinates averaged in a Cartesian coordinate system _c1 For the azimuth angle of the average coordinates averaged over all the current candidate group center coordinates +.>To average sitting after averaging the central coordinates of all the current candidate groupsElevation angle of target->Azimuth angle expressed as the central coordinate position of the current group to be discriminated, +.>Elevation angle, θ, expressed as center coordinates of the current group to be discriminated ₁ Expressed as azimuth threshold, ++>Represented as elevation threshold. Using the formula

And calculating the power P (q) of each primary coordinate in all candidate groups, performing second-layer search on the first sound source, searching the position with the maximum output power in all coordinates, and determining the position as the position of the first sound source.

The influence of coordinates around the sound source on the positioning of a second sound source at a later stage is reduced by removing the group near the sound source by giving the group a lower power value. Is provided with a region omega and omega internal coordinates in a spherical coordinate systemThe method meets the following conditions:

|θ-θ _s |≤θ ₂ ，

wherein θ is ₂ And (3) withThreshold values respectively set in azimuth and elevation angle, θ _s And->The azimuth and elevation of the first sound source coordinate which is already positioned are respectively, and the P (q) value of the small group contained in the region omega is uniformly given with a low power value E _l The subgroup it contains is not considered in the next step. If the currently located sound source is the first located sound source, the coordinate power of the adjacent group is not reduced and removed for the first sound source, but the coordinate power of the adjacent group is reduced and removed from the second located sound source.

The second sound source is positioned according to the modified power distribution. Since it cannot be guaranteed that the region Ω contains all the subgroups near the first sound source that may affect the positioning of the second sound source, the screening method used when positioning the first sound source is still adopted, that is, in the first layer search, the subgroup with the largest power value is selected from the subgroups not contained in the region Ω as the first candidate subgroup of the second sound source, and then the remaining subgroups in the lookup table are screened, so as to reduce the possibility of mixing in the subgroups with the first sound source as the main power contributing sound source. Suppose that u is already present ₂ The subgroup is selected as candidate subgroup in the second sound source, and then v is determined in the rest subgroups ₂ The conditions for whether the individual subgroups are candidate subgroups are:

wherein, the liquid crystal display device comprises a liquid crystal display device,represents the v th ₂ Center coordinates of the groups, wherein>Expressed as b in the existing candidate group in Cartesian coordinate system ₂ Center coordinates of the groups, (X _c2 ,Y _c2 ,Z _c2 ) Expressed as average coordinates, theta, of all the current candidate group center coordinates averaged in a Cartesian coordinate system _c2 For the azimuth angle of the average coordinates averaged over all the current candidate group center coordinates +.>For the elevation angle of the average coordinates averaged over the center coordinates of all candidate subgroups at present +.>Azimuth angle expressed as the central coordinate position of the current group to be discriminated, +.>Elevation angle, θ, expressed as center coordinates of the current group to be discriminated ₁ Expressed as azimuth threshold, ++>Represented as elevation threshold.

When candidate group u ₂ After the number of the (B) reaches a certain value n or all subgroups are judged, screening is stopped. Calculating the corresponding power P (q) values of all primary coordinates in the candidate group, sorting, and selecting one corresponding to the maximum power valueThe stage coordinates are the coordinates of the second sound source.

It should be noted that the terms like "upper", "lower", "left", "right", "front", "rear", and the like are also used for descriptive purposes only and are not intended to limit the scope of the invention in which the invention may be practiced, but rather the relative relationship of the terms may be altered or modified without materially altering the teachings of the invention.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. A multi-sound source localization method suitable for a reverberant environment, comprising the steps of:

s2, collecting voice signals by using a microphone array;

repeating the above operation until all sound source positions are positioned;

the specific method for determining a certain sound source candidate group is as follows: searching according to the result of calculating the central coordinate power value of each subgroup, selecting the subgroup corresponding to the maximum power value as a first candidate subgroup, and when judging the v-th subgroup in the rest subgroups, selecting the subgroup as the candidate subgroup under the following conditions:

|θ _v -θ _c |≤θ ₁ ，

wherein (X) _b ,Y _b ,Z _b ) Expressed as the center coordinates of the b-th subgroup of the existing candidate subgroups in the Cartesian coordinate system, (X) _cc ,Y _cc ,Z _cc ) Expressed as average coordinates, theta, of all the current candidate group center coordinates averaged in a Cartesian coordinate system _c To average the azimuth of the average coordinates of the current candidate group center coordinates,for the elevation angle of the average coordinate averaged over the center coordinates of all the candidate subgroups at present, θ _v Azimuth angle expressed as the central coordinate position of the current group to be discriminated, +.>Elevation angle, θ, expressed as center coordinates of the current group to be discriminated ₁ Represented as an azimuth threshold value,represented as elevation threshold;

the specific contents of removing the coordinates near the sound source in the search area are as follows:

|θ-θ _s |≤θ ₂ ，

2. A multi-sound source localization method for reverberant environments according to claim 1, wherein the grouping in step S1 is based on:

wherein I represents the total number of coordinates in the whole search area, J represents the current group number, z _j Represented as all coordinate sets in the j-th group; wherein the initial value of J is 1, and sequentially adding 1 until the formulaIt holds that with the change of i and j, p (q _i ∈z _j )、e(q _i ,z _j ) And z _j The center coordinates of the (B) are calculated through a K-means algorithm;

wherein e (q) _i ,z _j ) Representing a group error, defined as the difference in sound path between all microphone pairsAnd (3) summing;representing the distance position q between microphone k and microphone l _i TDOA value of>Representing the distance set z between microphone k and microphone l _j TDOA value of center coordinates, M represents the number of microphones, θ _t Represented as a threshold.

3. A multi-sound source localization method for reverberant environments according to claim 2, wherein the threshold θ _t The definition is as follows:

4. A multi-sound source localization method for reverberant environments according to claim 3, wherein the microphone array uses 16 microphones to collect the voice signals, the microphone array is cylindrical as a whole, and 8 microphones are uniformly distributed on the upper and lower contours.

5. A multi-sound source localization method for reverberant environments according to claim 2, wherein,

in the frequency domain, the formula for calculating the coordinate power is expressed as:

wherein G is _l (omega) is expressed as a frequency domain system function of the filter at the first microphone, X _l (ω) is denoted as the fourier transform of the first microphone signal,denoted as kthConjugation of the frequency domain system function of the filter at the microphone,/->Represented as the conjugate of the fourier transform of the kth microphone signal, τ _k,q Representing signal propagation times for coordinate position q through the kth microphone;