CN113419216A

CN113419216A - Multi-sound-source positioning method suitable for reverberation environment

Info

Publication number: CN113419216A
Application number: CN202110684270.9A
Authority: CN
Inventors: 胡秋岑; 吴礼福
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-09-21
Anticipated expiration: 2041-06-21
Also published as: CN113419216B

Abstract

The invention provides a multi-sound-source positioning method suitable for a reverberation environment, which comprises the steps of grouping all coordinates of a whole search area, and calculating the central coordinate of each group; collecting a speech signal using a 16 microphone array; a sound source is positioned by using a multi-sound-source positioning algorithm of double-layer search space clustering (TL-SSC), and coordinates near the sound source are removed from a search area; this operation is repeated until all sound sources have been located. The method solves the problem of multi-sound-source real-time positioning under the reverberation condition, requires fewer microphones compared with other multi-sound-source positioning methods, can improve the calculation efficiency while keeping high positioning accuracy, and meets the real-time requirement of mobile robot application.

Description

Multi-sound-source positioning method suitable for reverberation environment

Technical Field

The invention relates to the technical field of sound source positioning, in particular to a multi-sound-source positioning method suitable for a reverberation environment.

Background

The multi-sound-source positioning has wide requirements in real-time systems such as video conferences, voice recognition, mobile service robots and the like, and is always one of research hotspots in the field of sound signal processing. For example, when the mobile robot executes real-time intelligent service, the voice position is determined by a multi-sound-source positioning method, and the robot is guided to complete the service. The existing sound source positioning methods mainly comprise three types: a positioning method based on subspace, a positioning method based on controllable beam forming and a positioning method based on arrival time delay. The positioning method based on the subspace receives signals through each microphone, utilizes the orthogonality of signal subspace and noise subspace, constructs a spatial spectrum function and searches a spectrum peak to obtain the direction of a sound source, has high positioning precision, but has higher requirement on the stability of the sound source signal, and has poor positioning effect in a small space. The positioning method based on controllable beam forming is simple in principle and small in calculation amount, but the anti-noise capability is poor, environmental noise information needs to be obtained in advance, and the positioning instantaneity is difficult to guarantee. The positioning method based on the arrival time delay determines the position of the sound source by utilizing the sound path difference from the sound source to each microphone, the calculation complexity is generally less than that of the two methods, the positioning precision is higher, and the positioning real-time property is easy to meet.

Disclosure of Invention

The invention provides a multi-sound-source positioning method suitable for a reverberation environment, aiming at the application occasions with higher algorithm real-Time requirements, such as indoor mobile robots, in the prior art, namely the problem that improved Space exists when the calculation efficiency is improved while the precision is maintained as much as possible in a small Space, the invention adopts 16 microphones by a multi-sound-source positioning method based on double-layer Search Space Clustering (TL-SSC), improves the system calculation efficiency by grouping coordinates, real-Time double-layer searching, Clustering screening, threshold value judgment and the like, and realizes multi-sound-source real-Time positioning by using Time Difference of Arrival (TDOA) estimation.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for multiple sound source localization for use in reverberant environments, comprising the steps of:

s1, collecting coordinates in the whole search area, grouping the coordinates, and calculating the center coordinate of each group;

s2, collecting voice signals by using a microphone array;

s3, determining a candidate group of a certain sound source position by adopting a double-layer search space clustering multi-sound source positioning algorithm in a mode of calculating the central coordinate power of each group, positioning the sound source position in all coordinates contained in the candidate group, and removing the coordinates near the sound source in a search area;

the above operation of step S3 is repeated until all the sound source positions are located.

In order to optimize the technical scheme, the specific measures adopted further comprise:

further, the basis of the grouping in step S1 is:

if the ith coordinate q_iBelong to the jth group z_jThen p (q)_i∈z_j) Has a value of 1; if the ith coordinate q_iNot belonging to group j z_j，p(q_i∈z_j) Is 0;

wherein I represents the total number of coordinates in the entire search area, J represents the number of current groups, and z_jExpressed as all sets of coordinates in the jth group; wherein the initial value of J is 1, and sequentially adding 1 until the formula

It holds that as i and j change, p (q)_i∈z_j)、e(q_i，z_j) And z_jThe central coordinates of the three-dimensional image are calculated through a K-mean algorithm;

in the formula, e (q)_i,z_j) Representing the bunching error, defined as the difference in acoustic path between all microphone pairs

Summing;

representing the distance position q between microphone k and microphone l_iThe value of the TDOA value of (a),

representing the set z of distances between microphone k and microphone l_jTDOA value of center coordinates, M representing the number of microphones, θ_tDenoted as threshold.

Further, the threshold value θ_tIs defined as:

wherein λ represents wavelength, c is sound velocity, and f is sampling rate; in sound source localization, θ_tThe value is determined by the maximum frequency of the speech signal.

Furthermore, the microphone array collects voice signals by adopting 16 microphones, is integrally cylindrical, and is uniformly distributed with 8 microphones on the upper and lower outlines.

Further, in the time domain, the method for calculating the coordinate power specifically includes:

wherein y (t, q) represents the output value of coordinate position q at time t, g_m(t) denotes the impulse response of the filter at the m-th microphone, x_m(t+τ_m，q) Denotes the m-th microphone at time t + τ_m，qReceived signal, τ_m，qRepresenting the signal propagation time from coordinate position q to the mth microphone; in the frequency domain, the formula for calculating the coordinate power is expressed as:

in the formula, Y (ω, q) is an output value of the coordinate position q at the frequency ω, and X_m(ω) Fourier transform, G, of the mth microphone signal_m(ω) represents the frequency domain system function of the filter at the mth microphone;

based on a formula for calculating coordinate power in a frequency domain, obtaining a power output value P (q) of a coordinate position q as follows:

in the formula, G_l(ω) is expressed as a frequency domain system function of the filter at the l-th microphone, X_l(ω) is expressed as the fourier transform of the l-th microphone signal,

expressed as the conjugate of the frequency domain system function of the filter at the kth microphone,

expressed as the conjugate of the Fourier transform of the kth microphone signal, τ_k,qRepresenting the signal propagation time from coordinate position q to the kth microphone;

in the formula (I), the compound is shown in the specification,

a PHAT weighting coefficient between the ith microphone signal and the kth microphone signal;

after the power of the central coordinate of each group is calculated, a candidate group is determined, the position of a certain sound source is positioned in all coordinates contained in the candidate group, and the position of the sound source is determined as the maximum power value

The corresponding coordinates, namely:

further, the specific way of determining a certain sound source candidate group is as follows: searching according to the result of the power value calculation of the central coordinate of each group, selecting the group corresponding to the maximum power value as a first candidate group, and when judging the vth group in the rest groups, selecting the group as the candidate group under the condition that:

|θ_v-θ_c|≤θ₁，

stopping judging the candidate group when the number u of the candidate group reaches a certain number or all the groups are judged;

in the formula (X)_b,Y_b,Z_b) Expressed as the centre coordinates of the b-th of the existing candidate groups in a Cartesian coordinate system, (X)_cc,Y_cc，Z_cc) Expressed as the mean coordinate, theta, of the cartesian coordinate system after averaging the central coordinates of all the current candidate groups_cTo average the azimuth of the mean coordinate after averaging the center coordinates of all the current candidate groups,

elevation angle theta of average coordinate after averaging center coordinates of all current candidate groups_vExpressed as the position of the center coordinate of the current group to be discriminatedThe azimuth angle of (a) is,

elevation angle, theta, expressed as the central coordinate of the current group to be discriminated₁Indicated as an azimuth angle threshold value, is,

denoted as the elevation threshold.

Further, the specific contents of removing the coordinates near the sound source in the search area are:

a region omega is provided, and for the sub-groups contained in the region omega, the power of the coordinate position in the sub-group is uniformly reduced and a power value E is given_lMeanwhile, the small groups contained in the region omega are not considered in the subsequent step of positioning the positions of other sound sources;

coordinates in the spherical coordinate system within the region omega

The requirements are as follows:

|θ-θ_s|≤θ₂，

where theta is expressed as the azimuth angle of the coordinate,

elevation angle expressed as coordinates, r represents the distance of the coordinates from the origin of the coordinate system, theta_sExpressed as the azimuth of the last source coordinate position of the currently located source,

elevation angle, theta, expressed as the last sound source coordinate position of the currently located sound source₂Indicated as an azimuth angle threshold value, is,

denoted as the elevation threshold.

The invention has the beneficial effects that:

1. the multi-sound-source positioning method suitable for the reverberation environment selects a proper removal region and a candidate group screening condition through steps of cluster screening, threshold judgment and the like, so that the TL-SSC algorithm can be applied to a multi-sound-source system.

2. Compared with other multi-sound-source positioning methods, the multi-sound-source positioning method suitable for the reverberation environment needs fewer microphones, can improve the calculation efficiency while keeping high positioning accuracy, and meets the real-time requirement of mobile robot application.

Drawings

FIG. 1 is a schematic diagram of a microphone and sound source distribution according to the present invention; in the figure: the five-pointed star represents the microphone and the dots represent the sound source.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

The positioning of two of the plurality of sound sources is taken as an example for explanation.

Grouping the primary coordinates in the whole search area, and calculating the center coordinates of each group, wherein the grouping basis is as follows:

wherein I represents the total number of primary coordinates in the whole search area, J represents the number of current groups, M represents the number of microphones, and theta represents_tDenoted as threshold. z is a radical of_jExpressed as all sets of coordinates in the jth group, if the ith coordinate q_iBelong to the jth group z_jThen p (q)_i∈z_j) Is 1; otherwise it is 0. e (q)_i,z_j) Representing the bunching error, defined as the difference in acoustic path between all microphone pairs

Summing;

representing the distance position q between microphone k and microphone l_iTo determine the TDOA value of

The value is obtained. The initial value of J is 1, and the value is increased by 1 each time until the formula is reached

This is true. As J increases, p (q) changes with i and J_i∈z_j)，e(q_i,z_j) And z_jThe calculation will be performed by the K-means algorithm. The threshold θ is defined as:

where λ represents wavelength, c is speed of sound, and f is sampling rate; in sound source localization, θ_tThe value is determined by the maximum frequency of the speech signal.

A candidate subgroup of one of the sound sources (the sound source not first located) is screened out and the sound source is located. And calculating the power value corresponding to each group according to the obtained grouping result, performing first-layer search, and selecting the group corresponding to the maximum power value of the center coordinates as a first candidate group. Suppose u has been found in the first level search₁The group is screened as a candidate group, and in order to continue to select the group belonging to the same sound source as the existing candidate group, the v < th > group of the group is left₁When the group is judged, the conditions for selecting the group as a candidate group are as follows:

up to u₁Until a certain number n is reached or all subgroups are discriminated. Wherein

Expressed as the b-th candidate group in the existing candidate group in the Cartesian coordinate system₁Center coordinates of the individual groups, (X)_c1,Y_c1,Z_c1) Expressed as the mean coordinate, theta, of the cartesian coordinate system after averaging the central coordinates of all the current candidate groups_c1To average the azimuth of the mean coordinate after averaging the center coordinates of all the current candidate groups,

to average the current coordinates of all candidate group centers for elevation,

expressed as the azimuth of the central coordinate position of the current group to be discriminated,

denoted as the elevation threshold. Using the formula

And calculating the power P (q) of each primary coordinate in all the candidate groups, performing second-layer search on the first sound source, searching the position with the maximum output power in all the corresponding coordinates, and determining the position as the position of the first sound source.

And by adopting a mode of giving a lower power value to the small group, the small group near the sound source is removed, and the influence of the surrounding coordinates of the sound source on the positioning of a second sound source behind is reduced. Is provided with a region omega, and an omega inner coordinate in a spherical coordinate system

Satisfies the following conditions:

|θ-θ_s|≤θ₂，

wherein theta is₂And

threshold values, theta, set in azimuth and elevation, respectively_sAnd

uniformly assigning a low power value E to the P (q) value of the group contained in the region omega for the azimuth angle and the elevation angle of the positioned first sound source coordinate respectively_lThe subgroup it contains is not considered anymore in the following steps. Wherein, if the current positioned sound source is the first positioned sound source, the first sound source is not subjected to the small nearbyInstead of the reduction and removal of the coordinate power of the group, the reduction and removal of the coordinate power of the subgroup in the vicinity of the sound source is performed from the second localized sound source.

The second sound source is localized according to the modified power distribution. Because it cannot be guaranteed that the region Ω includes all the subgroups near the first sound source that may affect the positioning of the second sound source, the screening method when positioning the first sound source is still adopted, that is, in the first-layer search, the subgroup with the largest power value is selected from the subgroups not included in the region Ω as the first candidate subgroup of the second sound source, and then the remaining subgroups in the lookup table are screened to reduce the possibility that the subgroups with the first sound source as the main power contribution sound source are mixed. Suppose there is u₂The subgroup is screened as a candidate subgroup in the second sound source, and the v-th subgroup among the remaining subgroups is discriminated₂The conditions of whether a group is a candidate group are:

wherein the content of the first and second substances,

denotes the v th₂Center coordinates of the individual groups, wherein

Expressed as the b-th candidate group in the existing candidate group in the Cartesian coordinate system₂Center coordinates of the individual groups, (X)_c2,Y_c2,Z_c2) Expressed as the mean coordinate, theta, of the cartesian coordinate system after averaging the central coordinates of all the current candidate groups_c2To average the azimuth of the mean coordinate after averaging the center coordinates of all the current candidate groups,

denoted as the elevation threshold.

When the candidate group u₂When the number of the groups reaches a certain value n or all groups are judged, the screening is stopped. And calculating and sequencing the power P (q) values corresponding to all primary coordinates in the candidate group, and selecting the primary coordinate corresponding to the maximum power value as the coordinate of the second sound source.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A method for locating multiple sound sources suitable for use in a reverberant environment, comprising the steps of:

s2, collecting voice signals by using a microphone array;

and repeating the operations until all the sound source positions are positioned.

2. The method of claim 1, wherein the grouping in step S1 is based on:

It holds that as i and j change, p (q)_i∈z_j)、e(q_i,z_j) And z_jThe central coordinates of the three-dimensional image are calculated through a K-mean algorithm;

Summing;

3. The method as claimed in claim 2, wherein the threshold θ is set to be equal to or greater than a threshold θ_tIs defined as:

in the formulaλ represents wavelength, c is sound speed, f is sampling rate; in sound source localization, θ_tThe value is determined by the maximum frequency of the speech signal.

4. The method as claimed in claim 3, wherein the microphone array collects the speech signal by 16 microphones, and the microphone array is a cylinder and has 8 microphones uniformly distributed on the upper and lower outlines.

5. The method of claim 2, wherein the source location information is derived from the location information,

in the frequency domain, the formula for calculating the coordinate power is expressed as:

expressed as the conjugate of the Fourier transform of the kth microphone signal, τ_k，qRepresenting the signal propagation time from coordinate position q to the kth microphone;

in the formula (I), the compound is shown in the specification,

The corresponding coordinates, namely:

6. the method of claim 5, wherein the determining a candidate group of sound sources is performed by: searching according to the result of the power value calculation of the central coordinate of each group, selecting the group corresponding to the maximum power value as a first candidate group, and when judging the vth group in the rest groups, selecting the group as the candidate group under the condition that:

|θ_v-θ_c|≤θ₁，

in the formula (X)_b，Y_b，Z_b) Expressed as the centre coordinates of the b-th of the existing candidate groups in a Cartesian coordinate system, (X)_cc，Y_cc，Z_cc) Expressed as the mean coordinate, theta, of the cartesian coordinate system after averaging the central coordinates of all the current candidate groups_cTo average the azimuth of the mean coordinate after averaging the center coordinates of all the current candidate groups,

elevation angle theta of average coordinate after averaging center coordinates of all current candidate groups_vExpressed as the azimuth of the central coordinate position of the current group to be discriminated,

denoted as the elevation threshold.

7. The method as claimed in claim 1, wherein the specific content of removing the coordinates near the sound source in the search area is:

in the spherical coordinate system the coordinates within the region omega (theta,

r) is required to satisfy:

|θ-θ_s|≤θ₂，

where theta is expressed as the azimuth angle of the coordinate,

denoted as the elevation threshold.