CN116543784A

CN116543784A - Multi-sound source automatic gain control method based on sound field perception

Info

Publication number: CN116543784A
Application number: CN202310648272.1A
Authority: CN
Inventors: 卢佳欣; 陈枢茜; 朱阳燕; 王君
Original assignee: Nantong Institute of Technology
Current assignee: Nantong Institute of Technology
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-08-04

Abstract

The invention discloses a sound field perception-based multi-sound source automatic gain control method, which comprises space initialization and multi-sound source automatic gain initialization; converting the plurality of microphone signals into a frequency domain through short-time Fourier transform; obtaining the signal to noise ratio of the sound source angle of each region; selecting a sound source playing a leading role at a frequency point; iteratively solving sound source space propagation parameters of different areas at the current moment; calculating the inter-frame similarity of the time space distribution of the sound source at the current moment; updating the energy tracking of the sound source and obtaining the space automatic gain; using the reverberation grade of the Mel filter group, calculating to obtain the automatic compensation gain of the frequency band; the spatial gain is calculated and applied to the spectrum, and the processed audio is obtained by using short-time Fourier inverse transformation. The method can improve the effectiveness of volume balance under a multi-sound source switching scene, improve the effect of volume balance under a reverberation scene, and solve the problem of different energy losses of different frequency bands caused by sound transmission.

Description

Multi-sound source automatic gain control method based on sound field perception

Technical Field

The invention belongs to the field of sound field control, and particularly relates to a sound field perception-based multi-sound source automatic gain control method.

Background

With the rapid development of network telephony (Voice Over Internet Protocol, voIP) applications in recent years, an audio/video communication scheme represented by WebRTC technology is becoming more and more popular. As a core of an audio processing technology in a network call technology, an automatic gain control (Automatic Gain Control, AGC) technology performs automatic gain on received data by adaptively tracking an energy envelope of audio in a time domain or a frequency domain, so that volume control of sound can be effectively realized.

However, in audio communication scenes such as conference rooms, multi-person audio is collected through a microphone array, the sound sizes of different persons are not uniform, and the difference in positions between a sound source and the microphone array also causes significant difference in energy of audio data received by the microphones.

For a multi-sound source switching scenario, the flow of the conventional AGC algorithm is shown in fig. 1: firstly, sound source localization is carried out, a beam forming technology is used for estimating the energy of the sound sources based on the sound source localization result, and automatic gain tracking is carried out on the energy of each sound source.

Aiming at a multi-sound source switching scene, in practical application, the existing AGC, DAGC and ASGC technical schemes still face the following 5 problems:

1. when the number of microphones is small (less than 3), the sound source positioning accuracy cannot be guaranteed, and the sound source energy tracking result is not stable enough by utilizing beam forming.

2. When the speaker is faced with switching, the spatial sound source parameters cannot be accurately estimated.

3. The speed of the energy tracking of the spatial sound source cannot be adjusted in a self-adaptive mode, so that the energy of each sound source cannot be tracked quickly and stably, and the calculation accuracy and timeliness of the gain of each sound source cannot be ensured.

4. The method often faces the problem of high occupation of computing resources in the sound source positioning searching process.

5. This solution does not take into account the problem of sound distortion caused by reverberation.

Disclosure of Invention

The invention aims to: in order to overcome the defects in the prior art, the invention provides the multi-sound-source automatic gain control method based on sound field perception, which can improve the effectiveness of volume balance and improve the effect of volume balance in a reverberation scene under a multi-sound-source switching scene and solve the problem of different energy losses in different frequency bands caused by sound propagation.

The technical scheme is as follows: in order to achieve the above purpose, the technical scheme of the invention is as follows:

a multi-sound source automatic gain control method based on sound field perception comprises the following steps:

s1: space initialization: setting the maximum number of sound sources, dividing the space, and dividing the plane by 0-180 degrees to form a plurality of areas;

initializing the automatic gain of multiple sound sources;

s2: and (3) time-frequency conversion: converting the plurality of microphone signals into a frequency domain through short-time Fourier transform;

s3: sound field perception: obtaining the signal to noise ratio of the sound source angle of each region;

selecting a sound source playing a leading role at a frequency point;

iteratively solving sound source space propagation parameters of different areas at the current moment;

s4: calculating the time-space similarity of sound sources: calculating the inter-frame similarity of the time space distribution of the sound source at the current moment;

s5: spatial automatic gain control: updating the energy tracking of the sound source and obtaining the space automatic gain;

s6: adaptive multiband automatic gain compensation: using the reverberation grade of the Mel filter group, calculating to obtain the automatic compensation gain of the frequency band;

s7: gain smoothing, time-frequency inverse transformation: the spatial gain is calculated and applied to the spectrum, and the processed audio is obtained by using short-time Fourier inverse transformation.

Further, based on step S1: setting the maximum sound source number, dividing the space into 6 areas with 30 degrees as one area, dividing the plane 0-180 degrees into 6 areas, and initializing the angle eta= {15,45,75,105} of each area;

multiple sound source automatic gain initialization: automatic gain control energy flatteningSlip factor alpha _min ,α _max Target energy level

Further, based on step S2: the convolution model is widely applied to sound propagation in a closed space, and the mathematical model is as follows:

wherein: x is x _i (t) represents an audio signal received by the ith microphone at time t; s is(s) _j (t) represents a sound source j; n represents the number of sound sources;representing the transfer function of sound source j to microphone i;

model the microphone array signal x _i (t) conversion to a frequency-domain formRe-expansion into a double microphone, assumingAssuming the first microphone is taken as a reference, one can get:

further, based on step S3: assuming that the propagation parameter a of each sound source is a constant 1, and the spatial information of the sound source is reflected on the sound source propagation parameter delta, the iterative solution of the sound source spatial propagation parameter delta is as follows:

wherein:gamma denotes a forgetting factor and beta denotes the update speed of the sound source space parameter.

Further, based on step S4: based on the formula (23), the space in which the sound source with remarkable current moment is positioned is judged,

then, calculating the cross correlation of the spatial distribution vectors of the sound source at the current moment and the previous moment, and calculating a time-space similarity factor xi, wherein:

wherein if (DominateSource (t). Noteq. DominateStource (t-1)), h _prob (t)＝0,h _cnt (t)＝0。

Further, based on step S5: an adaptive energy smoothing algorithm based on the time-space similarity is designed:

where α (t) is a dynamic energy temporal smoothing factor, α _LT Is a fixed energy long-term tracking factor,

α(t)＝(α _min +(α _max -α _min ) ζ (t)) type (26)

The gain at time t can be obtained

Further, based on step S6: when the sound propagates indoors, diffuse reflection is more likely to occur because the wavelength of the low-frequency component of the sound is longer, the reverberation of the low-frequency component which is the audio on the audio data is larger, and the energy loss is smaller; the high-frequency component is easier to generate specular reflection due to shorter wavelength, and the reverberation at the microphone is smaller than the low-frequency component, and the energy loss is large; to improve this phenomenon, mel filter banks are based on:

wherein: m is M _F (ω) is the filter coefficients of the F-th mel-filter bank at frequency ω;

the reverberation levels K (F, τ) of the different mel filter banks are calculated:

based on the relationship between the degree of reverberation and the distance of the sound source from the microphone: the farther the sound source is from the microphone, the higher the degree of reverberation; with the low frequency part of the sound as a reference frequency band, automatic gain compensation based on the reverberation degree of the reference frequency band is constructed:

wherein F is _sum Is the number of frequency bands.

Further, based on step S7:

smoothing the calculated space gain, controlling the update speed of the automatic gain, smoothing the gain in the frequency range from the time dimension and the frequency dimension, and simultaneously:

wherein K is _min (t) tracking minimum acquisition of K (t) by improved minimum control recursive average technique;

in the formula (33), gain resetting is carried out when the reverberation degree is smaller than a certain threshold value, so that the problem of gain tracking errors caused by scattered noise and large reverberation is reduced;

in frequency, a gain interpolation algorithm is employed, namely:

finally, output through time-frequency inverse transformation:

X _out ＝ISTFT(Gain*X _in ) Formula (35).

The beneficial effects are that: the invention has the following effects:

(1) When the number of microphones is small, the invention creatively combines a sound source signal-to-noise ratio estimation technology based on cross correlation and a voice separation technology based on window joint orthogonal and decomposition estimation to solve the problem 1-3, and provides a multi-sound source maximum likelihood estimation sound source positioning method based on directional signal-to-noise ratio constraint, which changes the course of coarse search in sound source positioning into the process of solving the maximum signal-to-noise ratio of the sound source direction, converts the search problem into a closed solving problem, improves the precision of the coarse search and reduces the calculated amount; the method is limited by the number of microphones, the value of the traditional fine search algorithm cannot be exerted, the fine search is converted into the parameter estimation of maximum likelihood, and the more accurate sound source position is solved through a self-adaptive iterative process based on the signal-to-noise ratio estimation result of the maximum direction in the first step.

(2) Aiming at the problem 4, compared with the traditional scheme of calculating the sound source gain in the direction by using sound source positioning driving wave beam forming, the invention constructs the method for controlling the speed of sound source energy tracking based on the space-time similarity of the sound source, improves the effect of an automatic gain control technology on the sound source energy tracking, and reduces the requirement on the precision of sound source positioning estimation.

(3) Aiming at the problem 5, the invention constructs the self-adaptive gain compensation curve based on the hearing frequency under different reverberation intensities, and improves the problem of different energy losses of different frequency bands caused by sound transmission.

Drawings

FIG. 1 is an algorithm flow of the prior art automatic gain control technique;

FIG. 2 is a flow of a multi-sound source automatic gain control algorithm based on sound field perception;

FIG. 3 shows the values of different superparameter constCorresponding curves of (2);

fig. 4 shows the spectrum comparison results before and after the frequency adaptive gain compensation is added in the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

As shown in fig. 2, a multi-sound source automatic gain control method based on sound field perception comprises the following steps:

s1: space initialization: setting the maximum number of sound sources, dividing the space, and dividing the plane by 0-180 degrees to form a plurality of angle areas;

multiple sound source automatic gain initialization: automatic gain control energy smoothing factor alpha _min ,α _max Target energy level

s3: sound field perception: obtaining a signal-to-noise ratio SNR of the sound source angle eta of each region by using formulas (17), (18), (19), (20), (21);

selecting a sound source dominant at the (ω, t) frequency bin using equation (22);

iteratively solving the sound source space propagation parameters of different areas at the time t by using a formula (14);

s4: calculating the time-space similarity of sound sources: calculating the inter-frame similarity xi of the time space distribution of the sound source at the current moment by using formulas (23) (24) (25);

s5: spatial automatic gain control: updating the energy tracking of the sound source by using (26) (27) (28) and obtaining the space automatic gain;

s6: adaptive multiband automatic gain compensation: using (2) (18) (29) the reverberations level K (F, τ) of the Mel filter bank, using (30) to calculate the auto-compensation gain G for band F _Compensate ；

S7: gain smoothing, time-frequency inverse transformation: the spatial Gain is calculated using equations (32) (34) and applied to the spectrum, and the processed audio is obtained using the short-time inverse fourier transform.

When the number of microphones is small, a sound source signal-to-noise ratio estimation technology based on cross correlation and a voice separation technology based on window joint orthogonal and decomposition estimation (Degenerate Unmixing Estimation Technique, DUET) are combined, and a multi-sound source maximum likelihood estimation sound source positioning method based on direction signal-to-noise ratio constraint is provided, and the method changes the course of rough search in sound source positioning into the course of solving the maximum signal-to-noise ratio of the sound source direction, converts a search problem into a closed solving problem, improves the precision of rough search and reduces the calculated amount; the method is limited by the number of microphones, the value of the traditional fine search algorithm cannot be exerted, the fine search is converted into the parameter estimation of maximum likelihood, and the more accurate sound source position is solved through a self-adaptive iterative process based on the signal-to-noise ratio estimation result of the maximum direction in the first step.

The speed of sound source energy tracking is controlled based on the space-time similarity of the sound source, the effect of the automatic gain control technology on the sound source energy tracking is improved, and the accuracy requirement on sound source positioning estimation is reduced.

An adaptive gain compensation technique (Automatic Gain Compensate based on Reverb Level Estimation, AGC-RLE) based on auditory filter bank reverberant intensity estimation constructs an adaptive gain compensation curve based on auditory frequencies under different reverberant intensities, and solves the problem of different energy losses in different frequency bands caused by sound propagation.

Step S1: initializing:

space initialization: setting the maximum sound source number, dividing the space into 6 areas with 30 degrees as one area, dividing the plane 0-180 degrees into 6 areas, and initializing the angle eta= {15,45,75,105,135,165} of each area;

Step two: microphone signal time-frequency conversion

The convolution model is widely used for sound propagation in a closed space, and the mathematical model is as follows,

wherein, the liquid crystal display device comprises a liquid crystal display device,

x _i (t) represents the audio signal received by the ith microphone at time t (here an array of two microphones is taken as an example);

s _j (t) represents a sound source j;

n represents the number of sound sources;

representing the transfer function of sound source j to microphone i.

step three: sound field perception

Since the distance between the microphone arrays is much smaller than the distance between the microphones and the sound sources, and considering the influence of reverberation, we assume that the propagation parameter a of each sound source is a constant 1, and the spatial information of the sound source is reflected on the propagation parameter δ of the sound source, constructing an objective function

Based on the window joint quadrature assumption, i.e. at the frequency point (ω, τ), only one sound source k dominates, i.e.,

thus, the estimation of the spatial propagation parameters delta for each sound source in space can be performed by constructing likelihood functions

Wherein M represents the number of frequency bands,

π ^-1 (k) Representing the set of all the frequency bins at time t where the sound source k dominates.

Taking the logarithm of likelihood function formula (6), since only one sound source has dominant effect at each frequency point, it can be obtained

Maximizing the log-likelihood function equation (8) to obtain

To ensure continuity of the objective function, an auxiliary function is constructed

Then equation (8) can be deduced as

By deriving the objective function J (t), it is possible to obtain

X represents ₂ Iterative solution of (ω, t) conjugate of sound source space propagation parameter δ

gamma denotes a forgetting factor and beta denotes the update speed of the sound source space parameter.

From the above-described derivation, it can be seen that the sound source space parameter estimation based on the maximum likelihood estimation is strictly dependent on establishment of the assumption of the expression (4), and therefore, in order to ensure the accuracy of the sound source space parameter estimation, it is necessary to select the frequency point update (13) satisfying the assumption condition of the expression (4).

A cross-correlation-based directional sound source signal-to-noise ratio estimation method in a reverberation scene is adopted, and a cross-correlation function between microphones is assumed if the incidence direction is eta for a directional sound source j

Wherein, the liquid crystal display device comprises a liquid crystal display device,f _s for the sampling rate, d is the distance between the microphones,

modeling reverberation as a scattered noise field with a cross-correlation function between microphones that approximates

For the signals received by the microphone, we can obtain by practice of recursive smoothing

Wherein, the liquid crystal display device comprises a liquid crystal display device,alpha represents a temporal recursion constant.

Based on the scattering noise field model, the reverberation degree estimation based on the direct scattering ratio can be expressed as:

assuming that the signal-to-noise ratio of the sound source with the incident direction eta is SNR, the method can obtain

The equation (18) is developed by Euler's equation to obtain the sound source with the incident direction eta and the signal-to-noise ratio of SNR

substituting the formula (19) into the formula (4) to obtain

Step four: spatio-temporal similarity estimation

By the derivation of the last part, we can obtain the spatial propagation parameters delta of multiple sound sources _j In practical application, in order to ensure the stability and timeliness of tracking multiple sound sources, a vector for representing the spatial distribution of the sound sources is constructed,

V(t)＝{f(ρ(δ ₁ ,t)),...,f(ρ(δ ₁ t)) (22)

Wherein the method comprises the steps ofω _L ,ω _H Represented as the lower and upper frequency ranges used to represent sound source spatial information, ω in this context _L The lower limit is 500Hz, omega _H Selecting a frequency corresponding to the microphone array without frequency aliasing, wherein C represents the frequency omega _L ,ω _H The sum of the number of frequency bands in between, a nonlinear function f (x) is constructed, inversely proportional to the input x, the purpose of this function being to further increase the different spatial propagation parameters delta _j To satisfy monotonicity, we take p=3, and fig. 3 shows +_for different values of the super parameter const>Is a corresponding curve of (a).

Specifically, fig. 3 is a graph of y=f (x) at different values of the hyper-parameter const when p=3, and the corresponding curve first determines the space where the sound source with significant current time is located based on equation (23),

then, calculating the cross-correlation of the spatial distribution vectors of the sound source at the current moment and the sound source at the last moment, and calculating a time-space similarity factor xi, wherein

Step five: space automatic gain calculation

In order to avoid the influence of rapid switching, gain mutation and the like on the spatial gain caused by the estimated spatial sound source energy deviation, the energy of the original microphone is used for replacing the energy after the spatial filter is used for carrying out the spatial automatic gain calculation in the DAGC scheme, and meanwhile, an adaptive energy smoothing algorithm based on the time-space similarity is designed for replacing a fixed long-time smoothing factor to rapidly realize the rapid tracking of the sound source energy, so that the accuracy and the continuity of the spatial automatic gain calculation are ensured.

α(t)＝(α _min +(α _max -α _min ) ζ (t)) type (26)

The gain at time t can be obtained

Step six: adaptive multiband automatic gain compensation

When the sound propagates indoors, diffuse reflection is more likely to occur because the wavelength of the low-frequency component of the sound is longer, the reverberation of the low-frequency component which is the audio on the audio data is larger, and the energy loss is smaller; the high frequency component is more likely to be specularly reflected due to its shorter wavelength, and the reverberation at the microphone is less than the low frequency, with a large energy loss. To improve this, the present document is based on mel-filter banks,

wherein: m is M _F (ω) is the filter coefficient of the F-th mel-filter bank at frequency ω

The reverberations levels K (F, tau) of different Mel filter banks are calculated in combination with (28) (17),

based on the relationship between the degree of reverberation and the distance of the sound source from the microphone: the farther the sound source is from the microphone, the higher the degree of reverberation. The low frequency part of the sound is taken as a reference frequency band, automatic gain compensation based on the reverberation degree of the reference frequency band is constructed,

wherein F is _sum Is the number of frequency bands.

Step seven: adaptive multiband spatial gain smoothing and inverse fourier transform

After the space automatic gain and the multiband automatic gain are compensated, the gains of different frequency bands can be obtained, but the calculated space gain is directly applied, so that the sound distortion caused by discontinuous gains of the audio frequency in time and frequency can be caused. To improve the above problems, the present invention uses the sound source reverberation level instead of the energy comparison to control the update rate of the automatic gain, while smoothing the gain from the time and frequency dimensions in the frequency range, while using the synthesis of the gain in comparison to the DAGC algorithm

Wherein K is _min And (t) tracking the minimum value of K (t) through a modified minimum value control recursive average (Improved Minima Controlled Recursive Averaging, IMCRA) technology, and resetting the gain when the reverberation degree is smaller than a certain threshold value in the formula (33), so that the gain tracking error problem caused by scattered noise and large reverberation is reduced.

In frequency, gain interpolation algorithms are employed, i.e

Finally, output through time-frequency inverse transformation:

X _out ＝ISTFT(Gain*X _in ) Formula (35).

The method comprises the steps of constructing a sound source space parameter estimation algorithm based on maximum likelihood estimation of space signal-to-noise ratio control, accurately estimating space sound source parameters by the algorithm, adaptively adjusting the speed of space sound source energy tracking based on the space parameters of the sound source, realizing quick and stable tracking of energy of each sound source, and ensuring calculation accuracy and timeliness of gain of each sound source; in order to further improve the sound distortion caused by the reverberation, a frequency band self-adaptive gain compensation technology based on the reverberation parameter is constructed, and the influence of the reverberation on the sound frequency distortion is reduced.

In order to verify the effect of the spatial automatic gain technique proposed herein on frequency adaptive compensation, the spectrum results before and after frequency adaptive gain compensation are compared with those before and after frequency adaptive gain compensation are shown in fig. 3, and it can be seen from fig. 3 that the frequency adaptive gain compensation technique based on reverberation proposed herein can not only improve attenuation of high frequency energy, but also further improve direct component of sound and reduce influence of reverberation on sound quality.

As shown in fig. 4, in order to increase the spectrum comparison results before and after the frequency adaptive gain compensation, multiple experiments can reflect that the effectiveness of volume equalization in a multi-sound source switching scene can be verified through real recording data and simulation data of a conference room, and the effect of volume equalization in a reverberation scene can be improved.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A sound field perception-based multi-sound source automatic gain control method is characterized by comprising the following steps of: the method comprises the following steps:

initializing the automatic gain of multiple sound sources;

selecting a sound source playing a leading role at a frequency point;

2. The method for controlling the automatic gain of a plurality of sound sources based on sound field perception according to claim 1, wherein the method comprises the following steps: based on step S1: setting the maximum sound source number, dividing the space into 6 areas with 30 degrees as one area, dividing the plane 0-180 degrees into 6 areas, and initializing the angle eta= {15,45,75,105,135,165} of each area;

3. The method for controlling the automatic gain of a plurality of sound sources based on sound field perception according to claim 1, wherein the method comprises the following steps: based on step S2: the convolution model is widely applied to sound propagation in a closed space, and the mathematical model is as follows:

4. the method for controlling the automatic gain of a plurality of sound sources based on sound field perception according to claim 1, wherein the method comprises the following steps: based on step S3: assuming that the propagation parameter a of each sound source is a constant 1, and the spatial information of the sound source is reflected on the sound source propagation parameter delta, the iterative solution of the sound source spatial propagation parameter delta is as follows:

5. The method for controlling the automatic gain of a plurality of sound sources based on sound field perception according to claim 1, wherein the method comprises the following steps: based on step S4: based on the formula (23), the space in which the sound source with remarkable current moment is positioned is judged,

6. The method for controlling the automatic gain of a plurality of sound sources based on sound field perception according to claim 1, wherein the method comprises the following steps: based on step S5: an adaptive energy smoothing algorithm based on the time-space similarity is designed:

α(t)＝(α _min +(α _max -α _min ) ζ (t)) type (26)

The gain at time t can be obtained

7. The method for controlling the automatic gain of a plurality of sound sources based on sound field perception according to claim 1, wherein the method comprises the following steps: based on step S6: when the sound propagates indoors, diffuse reflection is more likely to occur because the wavelength of the low-frequency component of the sound is longer, the reverberation of the low-frequency component which is the audio on the audio data is larger, and the energy loss is smaller; the high-frequency component is easier to generate specular reflection due to shorter wavelength, and the reverberation at the microphone is smaller than the low-frequency component, and the energy loss is large; to improve this phenomenon, mel filter banks are based on:

wherein F is _sum Is the number of frequency bands.

8. The method for controlling the automatic gain of a plurality of sound sources based on sound field perception according to claim 1, wherein the method comprises the following steps: based on step S7:

in frequency, a gain interpolation algorithm is employed, namely:

finally, output through time-frequency inverse transformation:

X _out ＝ISTFT(Gain*X _in ) Formula (35).