CN111239691B

CN111239691B - Multi-sound-source tracking method for restraining main sound source

Info

Publication number: CN111239691B
Application number: CN202010184264.2A
Authority: CN
Inventors: 蔡卫平; 黄印君; 刘瑞娟
Original assignee: Jiujiang Vocational and Technical College
Current assignee: Jiujiang Vocational and Technical College
Priority date: 2020-03-08
Filing date: 2020-03-08
Publication date: 2022-03-08
Anticipated expiration: 2040-03-08
Also published as: CN111239691A

Abstract

A multi-sound source tracking method for inhibiting a main sound source comprises the steps of framing and windowing a voice sound source signal received by a microphone array; generating an initial particle group according to an initial state of a sound source; predicting a new particle state according to the state equation, and calculating an observed value of the particle state; judging a main source according to a positioning function value at the position estimation position of the sound source at the last moment; calculating the distance between the weak sound source particles and the main source, and constructing an attenuation coefficient according to the distance; multiplying the weaker sound source particle state observations near the primary source by the attenuation factor to reduce their value; constructing a pseudo-likelihood function of each sound source according to the positioning function and the attenuation coefficient, and calculating the particle weight of each sound source at the current moment according to the pseudo-likelihood function; and normalizing the particle weight of each sound source, and estimating the position of each sound source at the current moment according to the particle weight and the particle state. The invention can keep tracking two sound sources when the two sound sources are close to each other or the sound source tracks are crossed, and can be widely applied to the fields of robot hearing, audio monitoring and the like.

Description

Multi-sound-source tracking method for restraining main sound source

Technical Field

The invention relates to the technical field of multi-sound-source tracking based on a microphone array, in particular to a multi-sound-source tracking method for restraining a main sound source.

Background

The voice sound source positioning and tracking technology based on the microphone array is widely applied to the fields of digital hearing aids, robot hearing, intelligent monitoring and the like. Early voice sound source localization and tracking techniques were mainly applied to single sound sources, such as video conferences, vehicle-mounted hands-free voice communications, and the like. Through years of research, the positioning and tracking algorithm of a single sound source can reach higher precision. In recent years, with the expansion of the application field of sound source tracking technology, the problem of multi-sound source positioning and tracking needs to be considered in many cases.

Sound source tracking algorithms based on microphone arrays mainly comprise two types at present, wherein one type is represented by Kalman filtering and an improved algorithm thereof; another class is algorithms represented by the use of particle filtering. The former must be used under the condition of satisfying a plurality of assumptions, and the latter has wider application range and higher precision and occupies a mainstream position in a tracking algorithm. Currently, single sound source tracking algorithms in indoor environments are researched more, and multi-sound source tracking algorithms are researched less. Achieving multiple sound source tracking in reverberant environments is much more difficult than in the single sound source case. In addition to the influence of reverberation and noise on the tracking accuracy, the interference between sound sources will seriously reduce the accuracy of the tracking algorithm, and especially when the sound source tracks are close to or crossed, the traditional tracking algorithm has difficulty in identifying the sound source tracks, which results in the loss of targets.

In the process of particle filtering, the particle swarm moves along with the target track. When the two targets are closer, a part of particles of the weaker sound source fall into the region where the spatial spectrum peak of the stronger sound source is located, so that the part of particles is given too large weight, and the position estimation error of the corresponding weaker sound source is larger, namely, the position estimation error is too close to the position of the stronger sound source. As the iteration progresses, the particle distribution of the weaker sound source will be more and more similar to the particle distribution of the stronger sound source, and the estimated trajectory of the former will be nearly coincident with the latter. Even if the sound source distance increases again, the conventional tracking algorithm has difficulty in tracking a weak sound source, thereby losing the target.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a particle filter-based multiple voice sound source tracking algorithm. The algorithm keeps track of two sound sources in case the two sound source trajectories cross or are close.

In order to achieve the above object, the present invention provides a multi-sound-source tracking method that suppresses a primary sound source. This method refers to the stronger sound source as the primary source. In the tracking process, for the particles of the weaker sound source close to the main source, the weight of the particles is multiplied by a proper attenuation coefficient, so that the particles are not too large, the error of position estimation is reduced, and the estimated track overlapping is avoided. The method provided by the invention comprises the following steps:

s1, establishing a two-dimensional rectangular coordinate system, determining the coordinates of each array element in the microphone array, framing and windowing sound source signals received by the microphone array, and storing the sound source signals into a buffer area;

s2, generating an initial particle group of each sound source according to the initial state of each sound source:

under a rectangular coordinate system, the state vector of the ith sound source at the t frame is

The first two elements in the state vector

Representing the coordinates of the sound source, the last two elements

Representing the sound source speed. For state vector

For the corresponding sound source coordinate vector

To indicate. The number of particles is represented by N, N_sRepresenting the number of sound sources, the initial set of particles is:

wherein

I.e. the initial state of the sound source,

for the initial weight of the particle, i ═ 1: n is a radical of_s；

S3, predicting the new particle state of each sound source according to the state equation:

the equation of state can be expressed as

Specifically, Langevin's equation can be used as the state equation of the sound source, and its expression is as follows:

wherein u is_tIs a two-dimensional Gaussian random vector with a mean vector of [0, 0]^TThe covariance matrix is a second-order identity matrix, T is the duration of a frame of signal, i.e. the time interval of state update, the parameter a is exp (- β T),

wherein beta and

is a constant set according to the motion state of the sound source.

S4, calculating the observation value of each sound source particle state at the current time by using the positioning function according to the received signal frame:

x for received signal frame of microphone array_t＝[x₁(n)，x₂(n)，...，x_M(n)]^TWhere M is the number of array elements of the microphone array. A phase-shift weighted pilot response power-phase transform (SRP-PHAT) function is used as a positioning function, and the expression is:

wherein

For the imaginary sound source coordinates, l and m are array element numbers,

a generalized cross-correlation function for a pair of array elements, defined as

X_m(k) Is the m-th array element receiving signal x_m(n) discrete fourier transform, K being its number of points, representing the taking of a conjugate; ω is the analog angular frequency.

Is the time difference of arrival (TDOA) of a phantom sound source to an array element pair, where

The coordinate of the m-th array element is shown, c is the sound velocity (342m/s), and | is | · | | represents the 2-norm of the vector. The observed value of the particle state calculated according to the positioning function is as follows:

s5, constructing an attenuation coefficient according to the distance between the particles of the weak sound source and the main source, and multiplying the attenuation coefficient by the particle state observed value to obtain the particle state observed value after attenuation

S6, constructing a pseudo-likelihood function of each sound source according to the positioning function and the attenuation coefficient:

for pseudo-likelihood functions

Expressed, its expression is:

the function of the function max (-) ensures that the likelihood function is not negative, and r is a positive real number, so as to adjust the shape of the likelihood function and improve the performance of the tracking algorithm;

s7, calculating the weight of each sound source particle at the current moment according to the pseudo-likelihood function:

s8, normalizing the particle weight of each sound source:

s9, estimating the position of each sound source at the current moment according to the weight of the particles and the state of the particles;

s10, selecting the existing particle group according to the weight

Intermediate resampling to obtain resampled particle group

S11, the resampled particles and their weights are stored, and the process advances to step S3.

Further, the state observation of the sound source particles with weaker attenuation in step S5 includes the following steps:

s5.1, calculating a positioning function value at each sound source position estimation position at the previous moment according to the received signal frame at the previous moment, wherein the expression is as follows:

wherein the content of the first and second substances,

representing the position estimate of the ith sound source at time t-1;

s5.2, determining the source with the large positioning function value as a main source, namely the main source number at the time t is as follows:

after the main source number is obtained, the main source can be easily associated with the particles, for example, the particle group of the ith source is

If i is h_tIf the source is the primary source, otherwise, the source is notIs that;

s5.3, calculating the distance between the particles of the weak sound source and the position estimation position of the main source at the previous moment:

wherein the content of the first and second substances,

indicating the position of the particle of the ith sound source at time t,

representing the main source position estimation at the time of t-1;

s5.4, constructing an attenuation coefficient according to the distance in the step S5.3, and multiplying the state observed value of the weak sound source particles by the attenuation coefficient to obtain an attenuated particle state observed value:

where μ is a constant less than 1, the larger the value, the larger the attenuation amplitude, and z is a constant determining the attenuation rate, and the larger the value, the larger the attenuation amplitude, at the same distance.

The invention has the beneficial effects that:

(1) compared with the prior art, the method for restraining the main sound source is adopted in the sound source tracking process, and the problem that the estimated track of the weak sound source is overlapped with the track of the main sound source when the two sound sources are close to each other is solved well. Specifically, the present invention multiplies the observed value of the state of the particles of the weaker sound source close to the primary source by an appropriate attenuation coefficient in step S5, thereby reducing the weight of the particles. When two sound sources are close to each other or the tracks of the two sound sources are crossed, the weight of the particles of the weaker sound source close to the main source is reduced, so that the particles of the weaker sound source are not attracted by the main source, the independence of the particles of the sound sources is kept, and the tracking of the two sound sources can be kept.

(2) The method of the invention does not limit the shape of the microphone array, thus being applicable to any array type; the motion trail of the sound source is not limited, so that the method is suitable for the situation that the sound source moves in a curve.

Drawings

FIG. 1 is a main flow chart of the method of the present invention.

Fig. 2 is a flowchart of method step S5 of the present invention.

Fig. 3 is a schematic diagram of the position and sound source track of the microphone according to the present invention.

FIG. 4 is a schematic diagram illustrating the comparison between the tracking trajectory of the non-intersecting object and the real trajectory by the method of the present invention.

FIG. 5 is a schematic diagram illustrating the comparison between the tracking trajectory and the real trajectory of the intersecting object by the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings.

Example (b):

the simulated room has a size of 4m × 4m × 2.7m, and as shown in fig. 1, 2 and 3, 8 microphones are disposed on the wall surface around the room, and the height of each microphone is 1.464 m. The semicircle in the figure is the sound source track, and the sound source moves along the direction indicated by the arrow, and the height of the sound source is the same as that of the microphone. The sound source signal is male voice with two segments of duration of about 3.6s taken from TIMIT database, and the sampling frequency f_s8 kHz. The receiving signal of the microphone is generated by an image method, and a Gaussian white noise signal is added. SNR (signal to noise ratio) of 20dB and reverberation time T₆₀0.132 s. The frame length L is 512 points, the frames are not overlapped, a Hanning window is added, and the specific steps are as follows:

s2, generating an initial particle group of each sound source according to the initial state of each sound source;

The first two elements in the state vector

Representing the coordinates of the sound source, the last two elements

Representing the sound source speed. For state vector

For the corresponding sound source coordinate vector

To indicate. The number of particles is represented by N, N_sWhen 2 denotes the number of sound sources, and N in this example is 50, the initial particle group is:

wherein

I.e., the initial state of the sound source, assuming that the initial position of the sound source is known, the initial velocity is 0,

for the initial weight of the particle, i ═ 1: n is a radical of_s；

S3, predicting the new particle state of each sound source according to the state equation;

the equation of state can be expressed as

wherein u is_tIs a two-dimensional Gaussian random vector with a mean vector of [0, 0]^TThe covariance matrix is a second-order identity matrix, T is the duration of a frame of signal, i.e. the time interval of state update, in this example, T is L/f_s64 ms; the parameter a is exp (-T),

in this example, the speed of normal walking of a person is taken as beta-10 Hz,

s4, calculating the observed value of each sound source particle state at the current moment by using a positioning function according to the received signal frame;

x for received signal frame of microphone array_t＝[x₁(n)，x₂(n)，...，x_M(n)]^TWhere M is the number of elements of the microphone array, M is 8 in this example. A phase-shift weighted pilot response power-phase transform (SRP-PHAT) function is used as a positioning function, and the expression is:

wherein

For the imaginary sound source coordinates, l and m are array element numbers,

X_m(k) Is the m-th array element receiving signal x_m(n) discrete fourier transform, K being the number of points, in this example K is 512, which represents the conjugate, ω is 2 π kf_sand/K is the analog angular frequency.

The specific process is as follows:

wherein the content of the first and second substances,

representing the position estimate of the ith sound source at time t-1;

If i is h_tIf the current source is the main source, otherwise, the current source is not the main source;

wherein the content of the first and second substances,

indicating the position of the particle of the ith sound source at time t,

representing the primary source position estimate at time t-1.

where μ is a constant less than 1, the larger the value, the larger the attenuation amplitude, and z is a parameter determining the attenuation rate, and the larger the value, the larger the attenuation amplitude, at the same distance. With many particles in weaker sources in case of close distance between the sources

Small, the observed values of these particles should not be allowed to decay to near zero, but rather remain in appropriate proportions so that the particles can still be state estimatedAs a contribution, μ cannot take a value too close to 1. The value of μ cannot be too small, otherwise it is not sufficient to attenuate the influence of the main source. Considering that the distance between speakers is usually larger than 0.5m, when

The attenuation coefficient should be close to 1. In this example, μ is 0.8, and z is 0.15 m;

s6, constructing a pseudo-likelihood function of each sound source according to the positioning function and the attenuation coefficient;

for pseudo-likelihood functions

Expressed, its expression is:

the function max (·) is to ensure that the likelihood function is not negative, and r is a positive real number, which aims to adjust the shape of the likelihood function and improve the performance of the tracking algorithm, where r is 3 in this embodiment.

S7, calculating the weight of each sound source particle at the current moment according to the pseudo-likelihood function;

s8, normalizing the particle weight of each sound source;

s10, selecting the existing particle group according to the weight

Intermediate resampling to obtain resampled particle group

To illustrate the tracking effect of the method of the present invention, two evaluation indicators of tracking performance are defined, namely Root Mean Square Error (RMSE) and tracking loss rate. Both indices are calculated separately for each sound source. Root mean square error is defined as

Wherein: i is a sound source number;

the true position of the ith sound source at the time of the t-th frame,

is an estimated value thereof; k_sThe number of signal frames.

When tracing at a time

Then target i is considered lost in this trace. In this example, N is used_trackIndicating the number of traces, by N_lossWhen the number of target losses is represented, a tracking loss rate (TLP) is defined as

Case of non-intersecting sound source trajectories:

as shown in fig. 2, the trajectories of both sound sources are semi-circles with a radius of 0.75 m. SoundSource S₁Has a starting point coordinate of [1.2, 3 ]]The coordinate of the center of the circle is [1.95, 3 ]]Sound source S₂Has a starting point coordinate of [1.2, 1.8 ]]The coordinate of the center of the circle is [1.95, 1.8 ]]. The two sound sources make a uniform velocity circular motion within a period of 3.6s, and the distance is kept at 1.2 m. The tracking is performed by using the conventional algorithm and the algorithm of the present invention, respectively, fig. 4 is a representative one-time tracking result, in which a dotted line represents a real track and a solid line represents an estimated track.

As can be seen from fig. 3, when the sound source tracks do not intersect and are far away from each other, the method of the present invention can better track two sound sources. In order to further examine the tracking performance of the method of the present invention, tracking experiments were performed at different sound source distances using the method of the present invention and the conventional tracking method, and 30 calculations were performed in each case, and the results are shown in table 1. D in Table 1_sWhich represents the distance between the sound sources,

represents the mean root mean square error, which is the average of the successfully tracked RMSE, reflecting the tracking accuracy of the algorithm.

TABLE 1

As can be seen from the table 1, when the sound source distance is far, the tracking accuracy of the two methods is close, and the tracking loss rate of the method is slightly lower than that of the traditional method; the tracking loss rate of the conventional method is high when the sound source is close in distance, because particles of a weak sound source are easily "attracted" by the primary source when the distance is close, resulting in a loss of target. The method can effectively reduce the tracking loss rate by adjusting the observed value of the particles in time, and the tracking precision is higher than that of the traditional method.

Case of crossing sound source trajectories:

sound source S₁The track of (1) is still a semicircle, the radius is still 0.75m, and the coordinates of the starting point are [1.2, 2.4 ]]The coordinate of the center of the circle is [1.95, 2.4 ]](ii) a Sound source S₂Is a straight line, and the coordinates of the starting point are [1.8, 3.1 ]]Finally, finallyPoint coordinates of [1.9, 0.9 ]]. Fig. 5 shows a typical trace result.

As can be seen from fig. 5, for the case where the sound source trajectories intersect, the method of the present invention can still continuously track the two targets. This shows that for the weak sound source particles close to the main source, the observation values are attenuated, so as to effectively avoid being "attracted" by the main source. The 30 calculations were also performed for the case where the sound source trajectories crossed, and the results are shown in table 2.

TABLE 2

As can be seen from Table 2, the conventional method is applied to the sound source S₂The tracking loss rate of (2) increases dramatically. This indicates that at the crossing time S₁Is the main source, S₂The method is a weak sound source, and after target tracks are crossed, the traditional method is difficult to keep tracking the weak sound source. Still as can be seen from table 2, the method of the present invention can significantly reduce the tracking loss rate of weaker sound sources.

Claims

1. A multi-source tracking method for suppressing a primary sound source, comprising the steps of:

The first two elements in the state vector

Representing the coordinates of the sound source, the last two elements

Representing the sound source speed; for state vector

For the corresponding sound source coordinate vector

To represent; the number of particles is represented by N, N_sRepresenting the number of sound sources, the initial set of particles is:

wherein

I.e. the initial state of the sound source,

is the initial weight of the particle, i is 1: N_s；

the equation of state can be expressed as

wherein u is_tIs a two-dimensional Gaussian random vector with a mean vector of [0, 0]^TThe covariance matrix is a second order unitThe matrix, T is the duration of a frame of signals, i.e. the time interval for state updating, the parameter a is exp (- β T),

wherein beta and

is a constant set according to the motion state of the sound source;

x for received signal frame of microphone array_t＝[x₁(n)，x₂(n)，...，x_M(n)]^TWherein M is the number of array elements of the microphone array; a phase-shift weighted pilot response power-phase transform (SRP-PHAT) function is used as a positioning function, and the expression is:

wherein

For the imaginary sound source coordinates, l and m are array element numbers,

X_m(k) Is the m-th array element receiving signal x_m(n) discrete fourier transform, K being its number of points, representing the taking of a conjugate; ω is the analog angular frequency;

The coordinate of the mth array element is represented, c is the sound velocity (342m/s), and | is | · | | represents solving the 2-norm of the vector; the observed value of the particle state calculated according to the positioning function is as follows: