CN116110419A - Intelligent conference audio processing method and system for self-adaptive beam shaping - Google Patents
Intelligent conference audio processing method and system for self-adaptive beam shaping Download PDFInfo
- Publication number
- CN116110419A CN116110419A CN202310012812.7A CN202310012812A CN116110419A CN 116110419 A CN116110419 A CN 116110419A CN 202310012812 A CN202310012812 A CN 202310012812A CN 116110419 A CN116110419 A CN 116110419A
- Authority
- CN
- China
- Prior art keywords
- conference
- audio signal
- echo cancellation
- beam shaping
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000007493 shaping process Methods 0.000 title claims abstract description 141
- 238000003672 processing method Methods 0.000 title claims abstract description 27
- 230000005236 sound signal Effects 0.000 claims abstract description 324
- 238000012545 processing Methods 0.000 claims abstract description 47
- 238000000034 method Methods 0.000 claims abstract description 37
- 238000005457 optimization Methods 0.000 claims abstract description 32
- 230000003044 adaptive effect Effects 0.000 claims description 88
- 238000000354 decomposition reaction Methods 0.000 claims description 32
- 239000011159 matrix material Substances 0.000 claims description 27
- 238000004364 calculation method Methods 0.000 claims description 23
- 238000001228 spectrum Methods 0.000 claims description 20
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 3
- 230000000717 retained effect Effects 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 abstract 1
- 230000002401 inhibitory effect Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 22
- 230000008030 elimination Effects 0.000 description 7
- 238000003379 elimination reaction Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention relates to the technical field of conference audio processing, and discloses an intelligent conference audio processing method and system for self-adaptive beam shaping, wherein the method comprises the following steps: constructing a self-adaptive beam shaping model based on the acquired audio signals, and carrying out optimization solving; inputting the acquired audio signals into an optimal self-adaptive beam shaping model to obtain shaped denoising conference audio signals; constructing a multi-conference echo cancellation model and optimizing; and inputting the shaped denoising conference audio signal into an optimal conference echo cancellation model to obtain the enhanced and optimized conference audio signal. The invention adopts the self-adaptive wave beam shaping model to determine the sound source direction position in real time, thereby enhancing the audio signal intensity of the sound source direction microphone, reducing the audio signal intensity of the microphones in other directions, inhibiting noise under the condition of ensuring the audio signal intensity, and obtaining the model capable of carrying out the echo cancellation of the audio signal at the current moment based on the rapid solution of the audio signal at the last moment.
Description
Technical Field
The invention relates to the technical field of audio signal processing, in particular to an intelligent conference audio processing method and system for self-adaptive beam shaping.
Background
Conference often uses a single microphone enhancement approach to suppress noise. But in a noisy environment the performance of a single microphone to reject noise can be greatly reduced. In this case, the signal received by the single microphone generally includes more than two sound sources and noise signals generated in the surrounding environment. In practical situations, the desired sound source is moving, and various sound waves are reflected and reverberated in the closed space, which reduces the signal-to-noise ratio of the signal received by the single microphone, and reduces the communication quality. Conventional single microphones typically use spectral cancellation techniques as well as filtering techniques to enhance the desired speech signal. However, the signals of the sound sources and the noise in the signals received by the microphones are generally overlapped. Therefore, it is very difficult to enhance desired sound and effectively suppress noise and echo reverberation interference using the conventional voice enhancement method.
Disclosure of Invention
In view of this, the present invention provides an intelligent conference audio processing method of adaptive beam shaping, and aims to 1) propose an audio processing method in a multi-microphone conference scene in a traditional single-microphone conference scene, where the multi-microphone conference scene can significantly enhance the audio signal intensity, but also increases the signal duty ratio of noise and echo, so that the adaptive beam shaping model is adopted to determine the sound source direction position in real time, and according to the sound source direction position determined based on the phase difference, the adaptive beam shaping layer is optimized in real time, so as to enhance the audio signal intensity of the sound source direction microphone, and reduce the audio signal intensity of the other direction microphones, thereby shaping the multi-microphone audio signal into a single-microphone audio signal under the condition of suppressing the noise signals of the other microphones, and suppressing the noise under the condition of guaranteeing the audio signal intensity; 2) Because the phase difference exists among the audio signals received by different microphones in the multi-microphone scene, the sound source direction is obtained by maximizing the phase difference calculation of the real part area, and the self-adaptive wave beam shaping of the audio signals in the multi-microphone scene is performed based on the sound source direction; 3) The method comprises the steps that a desired audio signal calculation layer in a multi-conference echo cancellation model is utilized to receive a shaped conference audio signal of a previous period, noise and echo signals of the shaped conference audio signal are removed based on signal conversion processing, the audio signal with the noise and the echo signals removed is taken as the desired audio signal of a current period, the echo cancellation layer optimizes a signal filter in the echo cancellation layer based on the desired audio signal of the previous period and the shaped conference audio signal of the previous period, and fast echo cancellation processing is carried out on the shaped conference audio signal of the current period based on the optimized echo cancellation layer, so that the enhanced conference audio signal is obtained fast, and intelligent real-time processing of the audio signal in a conference room is realized.
In order to achieve the above purpose, the invention provides an intelligent conference audio processing method of adaptive beam shaping, which comprises the following steps:
s1: arranging a plurality of microphones in a conference room, collecting conference audio signals in the conference process, and constructing a self-adaptive beam shaping model based on the collected audio signals, wherein the self-adaptive beam shaping model takes the conference audio signals containing noise as input and the shaped denoising conference audio signals as output;
s2: carrying out optimization solution on the constructed self-adaptive beam shaping model to obtain an optimal self-adaptive beam shaping model,
the rapid optimization of the limited memory is a main implementation method of the model parameter optimization;
s3: inputting the acquired audio signals into an optimal self-adaptive beam shaping model to obtain shaped denoising conference audio signals;
s4: constructing a multi-conference echo cancellation model, wherein the multi-echo cancellation model takes the shaped noise-removed frequency signal as input and takes the audio signal after echo cancellation as output;
s5: optimizing the constructed multi-conference echo cancellation model to obtain an optimal multi-conference echo cancellation model, wherein a batch-processed minimum mean square optimization algorithm is a main method for optimizing the multi-conference echo cancellation model;
S6: and inputting the shaped denoising conference audio signal into an optimal conference echo cancellation model to obtain the enhanced and optimized conference audio signal.
As a further improvement of the present invention:
optionally, the step S1 of deploying a plurality of microphones in a conference room and collecting conference audio signals during the conference includes:
disposing m microphones in a conference room to form a microphone array, wherein the structure of the microphone array is a circular structure with a radius of R, the included angles of adjacent microphones are beta, the distance between the adjacent microphones is d, the included angles of the adjacent microphones all take the center of the circle of the circular structure as the vertex, and the distance between the adjacent microphones is the straight line distance between the adjacent microphones;
the method comprises the steps that a sound source in a conference room sends out audio signals, different microphones in a microphone array all receive the conference audio signals, the conference audio signals comprise audio signals sent out by the sound source, environment noise signals and echo signals, and the position of the sound source is an external area of the microphone array;
numbering m microphones in a microphone array in a clockwise order by taking the position of a microphone in the forward east direction as a starting point, wherein the conference audio signal received by the ith microphone in the microphone array is x i (t)=s i (t)+n i (t) wherein s i (t) is the audio signal of the sound source received by the ith microphone in the microphone array, including the audio signal and echo signal emitted by the sound source, n i (t) is an ambient noise signal received by an ith microphone in the microphone array, t represents time domain information, and then conference audio signals received by adjacent ith-1 microphones are:
x i-1 (t)=s i (t-τ i,i-1 )+n i-1 (t)
wherein:
τ i,i-1 (θ) represents the time delay of the i-1 th microphone receiving the conference audio signal compared to the i-th microphone;
c represents sound velocity, θ represents sound source direction, and is a parameter to be solved;
taking the center of a circle in the microphone array structure as the center, taking the west-east direction as the longitudinal axis and taking the north-south direction as the transverse axis to construct a coordinate system, so as to obtain the position angles of different microphones in the microphone array, wherein the sound source direction is the connection line of the sound source position and the center of the circle, and the included angle between the connection result and the half axis on the longitudinal axis is the sound source direction;
wherein the angle between the ith microphone and the half axis on the longitudinal axis is (i-1) beta, which represents the direction of the ith microphone. In the embodiment of the invention, the audio signals collected in the conference room are conference audio signals, and the conference audio signals can be decomposed into audio signals sent by sound sources, environment noise signals and echo signals based on an audio synthesis theory.
Optionally, the constructing an adaptive beam shaping model based on the acquired audio signal in the step S1 includes:
and constructing an adaptive beam shaping model based on the acquired audio signals, wherein the adaptive beam shaping model comprises a sound source direction estimation layer and an adaptive beam shaping layer, the sound source direction estimation layer is used for estimating the sound source direction according to the acquired audio signals, the adaptive beam shaping layer is used for screening the audio signals received by the microphones in the corresponding directions according to the sound source direction, and filtering, noise reduction and audio signal enhancement processing are carried out on the screened audio signals by utilizing an adaptive shaping filter.
Optionally, in the step S2, performing optimization solution on the constructed adaptive beam shaping model, including:
the adaptive beam shaping model optimizing and solving part is an adaptive beam shaping layer, wherein the adaptive beam shaping layer optimally updates the weights of conference audio signals acquired by microphones in different directions according to the solved sound source direction, inputs the acquired conference audio signals into the updated adaptive beam shaping layer, and obtains conference audio signal results mainly comprising sound source audio by performing delay compensation and weighting combination processing on the conference audio signals acquired by the microphones in different directions;
The optimization solving flow of the self-adaptive beam shaping layer is as follows:
s21: constructing an optimization objective function of a weight matrix in the adaptive beam shaping layer:
wherein:
m represents a row matrix formed by conference audio signals collected by M microphones after time delay compensation, wherein the time delay compensation method is to obtain the sound source direction by means of sum solutionCalculating the time delay of the conference audio signals acquired by the adjacent microphones by taking the conference audio signals acquired by the closest microphones as a reference, and carrying out equivalent compensation on the calculated time delay on the time domain information of the conference audio signals to obtain a row matrix formed by the conference audio signals acquired by m microphones after the time delay compensation;
w represents a weight matrix comprising m weight values, wherein W is a row matrix;
indicating the direction is +.>A superposition signal value of a result after time delay compensation of the conference audio signal acquired by the microphone in between;
s22: for the objective function G (W), a set of weight matrices W are randomly generated 0 Setting the current iteration number of the optimization solution as k, and setting the initial value of k as 1;
Wherein:
i is an identity matrix;
s k =c k+1 -c k ;
b k =g k+1 -g k ;
c k c is the standing point of the objective function after the kth iteration 0 =W 0 ;
g k The derivative of the objective function at the kth iteration, wherein the weight matrix is a variable;
D 0 Is a unit matrix;
s24: solving the dwell point c at the (k+1) th iteration k+1 :
c 0 =u 0 ;
S25: calculation c k+1 And c k Rank and eigenvalue of c k+1 And c k Rank and eigenvalue are the same, then c k+1 And if not, let k=k+1, returning to the step S23.
Optionally, the step S3 inputs the collected audio signal into an optimal adaptive beam shaping model for shaping, including:
the shaping flow of the collected conference audio signals based on the adaptive beam shaping model is as follows:
s31: fourier transform processing is performed on the collected conference audio signals to obtain fourier spectrums of the conference audio signals collected by different microphones, wherein the fourier spectrums of the conference audio signals collected by the ith microphone and the i-1 th microphone are expressed as follows:
wherein:
ω represents frequency, the horizontal axis in the fourier spectrum represents frequency, and the vertical axis represents the amplitude of the frequency signal;
j represents an imaginary unit, j 2 =-1;
X i (ω) represents a fourier spectrum representation of the conference audio signal acquired by the ith microphone;
s32: noise signals in conference audio signals collected by different microphones are uncorrelated, and phase difference information of Fourier spectrum representation of the conference audio signals collected by the different microphones is constructed:
Wherein:
re (·) represents the real part taking operation;
s33: determining a sound source direction based on phase difference information of different microphones:
wherein:
to solve for the direction of the sound source of sound in the conference room will be such that +.>The sound source direction reaching the maximum is used as a solving result;
s34: and (3) based on the determined sound source direction, optimizing and solving an adaptive beam shaping layer of the adaptive beam shaping model, inputting the acquired conference audio signals into the adaptive beam shaping layer, and performing time delay compensation and weighting and combining processing based on a weight matrix on the conference audio signals acquired by the microphones in different directions by the adaptive beam shaping layer to obtain a shaping result of the acquired audio signals. In the embodiment of the invention, the shaping result of the conference audio signal mainly comprises the audio signal of the sound source and the echo signal.
Optionally, the constructing a multi-conference echo cancellation model in the step S4 includes:
the method comprises the steps of constructing a multi-conference echo cancellation model, wherein the multi-conference echo cancellation model comprises an expected audio signal calculation layer and an echo cancellation layer, the expected audio signal calculation layer is used for receiving a shaped conference audio signal of a previous period, removing noise and echo signals of the shaped conference audio signal based on signal conversion processing, taking the audio signal from which the noise and echo signals are removed as the expected audio signal of a current period, the echo cancellation layer is structured as a signal filter, the echo cancellation layer optimizes the signal filter in the echo cancellation layer based on the expected audio signal of the previous period and the shaped conference audio signal of the previous period, and carries out quick echo cancellation processing on the shaped conference audio signal of the current period based on the optimized echo cancellation layer, so that an enhanced conference audio signal is quickly obtained;
The expected audio signal calculation flow of the last period in the expected audio signal calculation layer is as follows:
s41: inputting the shaped conference audio signal x ' (t ') of the previous period into a desired audio signal calculation layer, wherein t ' represents time domain information of the previous period;
s42: calculating the transformation result of the shaped conference audio signals under different decomposition scales:
wherein:
delta (t) represents a signal decomposition function, and in the embodiment of the invention, the selected signal decomposition function is a dbN wavelet function;
a represents a decomposition scale, and in the embodiment of the invention, a is [1,4];
d (a) represents a decomposition coefficient at the decomposition scale a;
s43: deleting the decomposition coefficients with the decomposition coefficients smaller than the preset value threshold, and reconstructing the rest of the decomposition coefficients into audio signals:
wherein:
d' (a) represents the set of decomposition coefficients that are preserved;
x '(t') represents a reconstructed audio signal based on the retained decomposition coefficients;
the reconstructed audio signal x "(t') is taken as the desired audio signal for the last period.
Optionally, the optimizing the constructed multi-conference echo cancellation model in the step S5 includes:
the multi-conference echo cancellation model receives the shaped conference audio signal and the expected audio signal of the previous period, optimizes a signal filter in an echo cancellation layer to obtain an echo cancellation layer which can be used for echo cancellation of the shaped conference audio signal of the current period, and the optimization flow of the echo cancellation layer is as follows:
S51: acquiring a shaped conference audio signal x '(t'), an expected audio signal x '(t') and an echo cancellation layer coefficient of the last period, wherein the echo cancellation layer coefficient is a weight vector with the same length as the audio signal, and the limited convolution sum result of the echo cancellation layer coefficient and the shaped conference audio signal is the audio signal after echo cancellation;
s52: let the echo cancellation layer coefficient of the previous period be L (0), then the echo cancellation layer coefficient obtained by the h iteration is L (h);
s53: calculating an echo cancellation audio signal f (h) based on the h iteration result:
f(h)=L(h)x′(t′)
calculating to obtain an iteration error of the h iteration: e (h) =x "(t') -f (h);
if e (h) is smaller than the threshold value, L (h) is an echo cancellation layer coefficient obtained by optimizing and solving;
s54: designing a variable step factor for echo cancellation coefficient update:
μ(h)=ln(1+|e(h)| s )
wherein:
s is a variable length coefficient, and is set to 0.5;
S56: and obtaining an echo cancellation layer coefficient L (h+1) of the h+1th iteration based on the updated coefficient and the variable step factor:
let h=h+1, return to step S53;
and constructing an optimal conference echo cancellation model based on the optimal solution result of the audio signal at the last moment.
Optionally, in the step S6, the shaped denoising conference audio signal is input into an optimal conference echo cancellation model to perform echo cancellation processing, including:
and inputting the shaped denoising conference audio signal at the current moment into an optimal conference echo cancellation model for echo cancellation processing, and obtaining and playing the conference audio signal finally enhanced at the current moment.
To solve the above problems, the present invention provides an adaptive beam-shaping intelligent conference audio processing system, the system comprising:
an audio signal acquisition device for deploying a plurality of microphones in a conference room and collecting conference audio signals during a conference;
the beam shaping device is used for constructing a self-adaptive beam shaping model based on the collected audio signals, optimizing the constructed multi-conference echo cancellation model to obtain an optimal multi-conference echo cancellation model, and inputting the collected noise-containing frequency signals into the self-adaptive beam shaping model after optimization solution;
the echo cancellation module is used for optimizing the constructed multi-conference echo cancellation model to obtain optimal multi-conference echo cancellation model parameters, and inputting the shaped noise-removed frequency signals into the optimized multi-conference echo cancellation model to obtain enhanced and optimized conference audio signals.
In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:
a memory storing at least one instruction; and
And the processor executes the instructions stored in the memory to realize the intelligent conference audio processing method of the adaptive beam shaping.
In order to solve the above-mentioned problems, the present invention also provides a computer-readable storage medium having stored therein at least one instruction that is executed by a processor in an electronic device to implement the above-mentioned adaptive beam-shaping intelligent conference audio processing method.
Compared with the prior art, the invention provides an intelligent conference audio processing method for self-adaptive beam shaping, which has the following advantages:
firstly, the scheme provides an adaptive beam shaping model, wherein the shaping flow of the acquired conference audio signals based on the adaptive beam shaping model is as follows: fourier transform processing is performed on the collected conference audio signals to obtain fourier spectrums of the conference audio signals collected by different microphones, wherein the fourier spectrums of the conference audio signals collected by the ith microphone and the i-1 th microphone are expressed as follows:
Wherein: ω represents frequency, the horizontal axis in the fourier spectrum represents frequency, and the vertical axis represents the amplitude of the frequency signal; j represents an imaginary unit, j 2 =-1;X i (ω) represents a fourier spectrum representation of the conference audio signal acquired by the ith microphone; noise signals in conference audio signals collected by different microphones are uncorrelated, and phase difference information of Fourier spectrum representation of the conference audio signals collected by the different microphones is constructed:
wherein: re (·) represents the real part taking operation;indicating phase difference information received by the i-th microphone and the i-1-th microphone; determining a sound source direction based on phase difference information of different microphones:
wherein:to solve for the direction of the sound source of sound in the conference room will be such that +.>The sound source direction reaching the maximum is used as a solving result; based on the determined sound source direction, optimizing and solving to obtain an adaptive beam shaping layer of an adaptive beam shaping model, inputting the acquired conference audio signals into the adaptive beam shaping layer, and the adaptive beam shaping layer is used for carrying out microphone in different directionsAnd performing time delay compensation and weighting combination processing based on a weight matrix on the conference audio signals acquired by the wind to obtain a shaping result of the acquired audio signals. According to the scheme, the self-adaptive beam shaping model is adopted to determine the sound source direction in real time, the sound source direction is obtained through phase difference calculation of the maximized real part area, the self-adaptive beam shaping layer is optimized and solved in real time, the audio signal intensity of a microphone in the sound source direction is enhanced, the audio signal intensity of microphones in other directions is reduced, and therefore under the condition that noise signals of other microphones are restrained, multi-microphone audio signals are shaped into single-microphone audio signals, noise is restrained under the condition that the audio signal intensity is ensured, and conference audio enhancement under a multi-microphone scene is achieved.
Therefore, the scheme provides a multi-conference echo cancellation method, the multi-conference echo cancellation model receives the shaped conference audio signal and the expected audio signal of the previous period, optimizes a signal filter in an echo cancellation layer to obtain the echo cancellation layer which can be used for echo cancellation of the shaped conference audio signal of the current period, and the optimization flow of the echo cancellation layer is as follows: acquiring a shaped conference audio signal x '(t'), an expected audio signal x '(t') and an echo cancellation layer coefficient of the last period, wherein the echo cancellation layer coefficient is a weight vector with the same length as the audio signal, and the limited convolution sum result of the echo cancellation layer coefficient and the shaped conference audio signal is the audio signal after echo cancellation; let the echo cancellation layer coefficient of the previous period be L (0), then the echo cancellation layer coefficient obtained by the h iteration is L (h); calculating an echo cancellation audio signal f (h) based on the h iteration result:
f(h)=L(h)x′(t′)
calculating to obtain an iteration error of the h iteration: e (h) =x "(t') -f (h); if e (h) is smaller than the threshold value, L (h) is an echo cancellation layer coefficient obtained by optimizing and solving; designing a variable step factor for echo cancellation coefficient update:
μ(h)=ln(1+|e(h)| s )
Wherein: s is a variable length coefficient, and is set to be 0.5, and the variable step length factor is designed based on a sigmoid function and is quickly solvedTo an available parameter; determining update coefficients based on least mean square optimization algorithmAnd obtaining an echo cancellation layer coefficient L (h+1) of the h+1th iteration based on the updated coefficient and the variable step factor:
let h=h+1, return to the above step; and constructing an optimal conference echo cancellation model based on the optimal solution result of the audio signal at the last moment. And inputting the shaped denoising conference audio signal at the current moment into an optimal conference echo cancellation model for echo cancellation processing, and obtaining and playing the conference audio signal finally enhanced at the current moment. Compared with the traditional scheme that multi-step signal transformation calculation echo elimination is needed for the current audio signal, real-time output of the audio signal after the echo elimination cannot be achieved due to large calculated amount, abnormal sound is difficult to suddenly appear through analysis of a scene in a conference room, therefore, the method and the device utilize an expected audio signal calculation layer in a multi-conference echo elimination model to receive the shaped conference audio signal of the previous period, remove noise and echo signals of the shaped conference audio signal based on signal transformation processing, take the audio signal with the noise and the echo signals removed as the expected audio signal of the current period, optimize the audio signal of the previous time to obtain filter parameters of the current time, wherein the echo elimination layer optimizes a signal filter in the echo elimination layer based on the expected audio signal of the previous period and the shaped conference audio signal of the previous period, and accordingly, the method and the device can conduct rapid echo elimination processing on the shaped conference audio signal of the current period based on the optimized echo elimination layer, rapidly obtain the enhanced conference audio signal, and achieve intelligent real-time processing of the audio signal in the conference room.
Drawings
Fig. 1 is a flow chart of an intelligent conference audio processing method with adaptive beam shaping according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of an adaptive beam-shaping intelligent conference audio processing system according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing an intelligent conference audio processing method for adaptive beam shaping according to an embodiment of the present invention.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the application provides an intelligent conference audio processing method for adaptive beam shaping. The execution subject of the adaptive beam shaping intelligent conference audio processing method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the adaptive beam-shaping intelligent conference audio processing method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Example 1:
s1: a plurality of microphones are deployed in a conference room, conference audio signals are collected in the conference process, an adaptive beam shaping model is built based on the collected audio signals, the adaptive beam shaping model takes the conference audio signals containing noise as input, and the shaped denoising conference audio signals as output.
The step S1 of disposing a plurality of microphones in a conference room and collecting conference audio signals during the conference includes:
disposing m microphones in a conference room to form a microphone array, wherein the structure of the microphone array is a circular structure with a radius of R, the included angles of adjacent microphones are beta, the distance between the adjacent microphones is d, the included angles of the adjacent microphones all take the center of the circle of the circular structure as the vertex, and the distance between the adjacent microphones is the straight line distance between the adjacent microphones;
the method comprises the steps that a sound source in a conference room sends out audio signals, different microphones in a microphone array all receive the conference audio signals, the conference audio signals comprise audio signals sent out by the sound source, environment noise signals and echo signals, and the position of the sound source is an external area of the microphone array;
numbering m microphones in a microphone array in a clockwise order by taking the position of a microphone in the forward east direction as a starting point, wherein the conference audio signal received by the ith microphone in the microphone array is x i (t)=s i (t)+n i (t) wherein s i (t) is the audio signal of the sound source received by the ith microphone in the microphone array, including the audio signal and echo signal emitted by the sound source, n i (t) is an ambient noise signal received by an ith microphone in the microphone array, t represents time domain information, and then conference audio signals received by adjacent ith-1 microphones are:
x i-1 (t)=s i (t-τ i,i-1 )+n i-1 (t)
wherein:
τ i,i-1 (θ) represents the time delay of the i-1 th microphone receiving the conference audio signal compared to the i-th microphone;
c represents sound velocity, θ represents sound source direction, and is a parameter to be solved;
taking the center of a circle in the microphone array structure as the center, taking the west-east direction as the longitudinal axis and taking the north-south direction as the transverse axis to construct a coordinate system, so as to obtain the position angles of different microphones in the microphone array, wherein the sound source direction is the connection line of the sound source position and the center of the circle, and the included angle between the connection result and the half axis on the longitudinal axis is the sound source direction;
wherein the angle between the ith microphone and the half axis on the longitudinal axis is (i-1) beta, which represents the direction of the ith microphone.
In the step S1, an adaptive beam shaping model is constructed based on the collected audio signal, which includes:
and constructing an adaptive beam shaping model based on the acquired audio signals, wherein the adaptive beam shaping model comprises a sound source direction estimation layer and an adaptive beam shaping layer, the sound source direction estimation layer is used for estimating the sound source direction according to the acquired audio signals, the adaptive beam shaping layer is used for screening the audio signals received by the microphones in the corresponding directions according to the sound source direction, and filtering, noise reduction and audio signal enhancement processing are carried out on the screened audio signals by utilizing an adaptive shaping filter.
S2: and carrying out optimization solution on the constructed self-adaptive beam shaping model to obtain an optimal self-adaptive beam shaping model, wherein the rapid optimization of the limited memory is a main implementation method of the model parameter optimization.
And in the step S2, the constructed self-adaptive beam shaping model is optimized and solved, and the method comprises the following steps:
the adaptive beam shaping model optimizing and solving part is an adaptive beam shaping layer, wherein the adaptive beam shaping layer optimally updates the weights of conference audio signals acquired by microphones in different directions according to the solved sound source direction, inputs the acquired conference audio signals into the updated adaptive beam shaping layer, and obtains conference audio signal results mainly comprising sound source audio by performing delay compensation and weighting combination processing on the conference audio signals acquired by the microphones in different directions;
the optimization solving flow of the self-adaptive beam shaping layer is as follows:
s21: constructing an optimization objective function of a weight matrix in the adaptive beam shaping layer:
wherein:
m represents a row matrix formed by conference audio signals collected by M microphones after delay compensation, wherein the delay compensation method is thatTo obtain the sound source direction by and solving Calculating the time delay of the conference audio signals acquired by the adjacent microphones by taking the conference audio signals acquired by the closest microphones as a reference, and carrying out equivalent compensation on the calculated time delay on the time domain information of the conference audio signals to obtain a row matrix formed by the conference audio signals acquired by m microphones after the time delay compensation;
w represents a weight matrix comprising m weight values, wherein W is a row matrix;
indicating the direction is +.>A superposition signal value of a result after time delay compensation of the conference audio signal acquired by the microphone in between;
s22: for the objective function G (W), a set of weight matrices W are randomly generated 0 Setting the current iteration number of the optimization solution as k, and setting the initial value of k as 1;
Wherein:
i is an identity matrix;
s k =c k+1 -c k ;
b k =g k+1 -g k ;
c k c is the standing point of the objective function after the kth iteration 0 =W 0 ;
g k The derivative of the objective function at the kth iteration, wherein the weight matrix is a variable;
D 0 is a unit matrix;
s24: solving the dwell point c at the (k+1) th iteration k+1 :
c 0 =u 0 ;
S25: calculation c k+1 And c k Rank and eigenvalue of c k+1 And c k Rank and eigenvalue are the same, then c k+1 And if not, let k=k+1, returning to the step S23.
S3: and inputting the acquired audio signals into an optimal self-adaptive beam shaping model to obtain the shaped denoising conference audio signals.
And S3, inputting the acquired audio signals into an optimal self-adaptive beam shaping model for shaping, wherein the step comprises the following steps:
the shaping flow of the collected conference audio signals based on the adaptive beam shaping model is as follows:
s31: fourier transform processing is performed on the collected conference audio signals to obtain fourier spectrums of the conference audio signals collected by different microphones, wherein the fourier spectrums of the conference audio signals collected by the ith microphone and the i-1 th microphone are expressed as follows:
wherein:
ω represents frequency, the horizontal axis in the fourier spectrum represents frequency, and the vertical axis represents the amplitude of the frequency signal;
j represents an imaginary unit, j 2 =-1;
X i (ω) represents a fourier spectrum representation of the conference audio signal acquired by the ith microphone;
s32: noise signals in conference audio signals collected by different microphones are uncorrelated, and phase difference information of Fourier spectrum representation of the conference audio signals collected by the different microphones is constructed:
wherein:
re (·) represents the real part taking operation;
S33: determining a sound source direction based on phase difference information of different microphones:
wherein:
to solve for the direction of the sound source of sound in the conference room will be such that +.>The sound source direction reaching the maximum is used as a solving result;
s34: and (3) based on the determined sound source direction, optimizing and solving an adaptive beam shaping layer of the adaptive beam shaping model, inputting the acquired conference audio signals into the adaptive beam shaping layer, and performing time delay compensation and weighting and combining processing based on a weight matrix on the conference audio signals acquired by the microphones in different directions by the adaptive beam shaping layer to obtain a shaping result of the acquired audio signals.
S4: and constructing a multi-conference echo cancellation model, wherein the multi-echo cancellation model takes the shaped noise-removed frequency signal as input and takes the audio signal after echo cancellation as output.
And in the step S4, a multi-conference echo cancellation model is constructed, which comprises the following steps:
the method comprises the steps of constructing a multi-conference echo cancellation model, wherein the multi-conference echo cancellation model comprises an expected audio signal calculation layer and an echo cancellation layer, the expected audio signal calculation layer is used for receiving a shaped conference audio signal of a previous period, removing noise and echo signals of the shaped conference audio signal based on signal conversion processing, taking the audio signal from which the noise and echo signals are removed as the expected audio signal of a current period, the echo cancellation layer is structured as a signal filter, the echo cancellation layer optimizes the signal filter in the echo cancellation layer based on the expected audio signal of the previous period and the shaped conference audio signal of the previous period, and carries out quick echo cancellation processing on the shaped conference audio signal of the current period based on the optimized echo cancellation layer, so that an enhanced conference audio signal is quickly obtained;
The expected audio signal calculation flow of the last period in the expected audio signal calculation layer is as follows:
s41: inputting the shaped conference audio signal x ' (t ') of the previous period into a desired audio signal calculation layer, wherein t ' represents time domain information of the previous period;
s42: calculating the transformation result of the shaped conference audio signals under different decomposition scales:
wherein:
delta (t) represents a signal decomposition function, and in the embodiment of the invention, the selected signal decomposition function is a dbN wavelet function;
a represents a decomposition scale, and in the embodiment of the invention, a is [1,4];
d (a) represents a decomposition coefficient at the decomposition scale a;
s43: deleting the decomposition coefficients with the decomposition coefficients smaller than the preset value threshold, and reconstructing the rest of the decomposition coefficients into audio signals:
wherein:
d' (a) represents the set of decomposition coefficients that are preserved;
x '(t') represents a reconstructed audio signal based on the retained decomposition coefficients;
the reconstructed audio signal x "(t') is taken as the desired audio signal for the last period.
S5: and optimizing the constructed multi-conference echo cancellation model to obtain an optimal multi-conference echo cancellation model, wherein a batch-processed minimum mean square optimization algorithm is a main method for optimizing the multi-conference echo cancellation model.
And in the step S5, optimizing the constructed multi-conference echo cancellation model, which comprises the following steps:
the multi-conference echo cancellation model receives the shaped conference audio signal and the expected audio signal of the previous period, optimizes a signal filter in an echo cancellation layer to obtain an echo cancellation layer which can be used for echo cancellation of the shaped conference audio signal of the current period, and the optimization flow of the echo cancellation layer is as follows:
s51: acquiring a shaped conference audio signal x '(t'), an expected audio signal x '(t') and an echo cancellation layer coefficient of the last period, wherein the echo cancellation layer coefficient is a weight vector with the same length as the audio signal, and the limited convolution sum result of the echo cancellation layer coefficient and the shaped conference audio signal is the audio signal after echo cancellation;
s52: let the echo cancellation layer coefficient of the previous period be L (0), then the echo cancellation layer coefficient obtained by the h iteration is L (h);
s53: calculating an echo cancellation audio signal f (h) based on the h iteration result:
f(h)=L(h)x′(t′)
calculating to obtain an iteration error of the h iteration: e (h) =x "(t') -f (h);
if e (h) is smaller than the threshold value, L (h) is an echo cancellation layer coefficient obtained by optimizing and solving;
S54: designing a variable step factor for echo cancellation coefficient update:
μ(h)=ln(1+|e(h)| s )
wherein:
s is a variable length coefficient, and is set to 0.5;
S56: and obtaining an echo cancellation layer coefficient L (h+1) of the h+1th iteration based on the updated coefficient and the variable step factor:
let h=h+1, return to step S53;
and constructing an optimal conference echo cancellation model based on the optimal solution result of the audio signal at the last moment.
S6: and inputting the shaped denoising conference audio signal into an optimal conference echo cancellation model to obtain the enhanced and optimized conference audio signal.
And S6, inputting the shaped denoising conference audio signal into an optimal conference echo cancellation model for echo cancellation processing, wherein the step comprises the following steps of:
and inputting the shaped denoising conference audio signal at the current moment into an optimal conference echo cancellation model for echo cancellation processing, and obtaining and playing the conference audio signal finally enhanced at the current moment.
Example 2:
as shown in fig. 2, a functional block diagram of an adaptive beam-shaping intelligent conference audio processing system according to an embodiment of the present invention may implement the adaptive beam-shaping intelligent conference audio processing method in embodiment 1.
The adaptive beam-shaping intelligent conference audio processing system 100 of the present invention may be installed in an electronic device. Depending on the functions implemented, the adaptive beam-shaping intelligent conference audio processing system may include an audio signal acquisition device 101, a beam-shaping device 102, and an echo cancellation module 103. The module of the invention, which may also be referred to as a unit, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.
An audio signal acquisition device 101 for deploying a plurality of microphones in a conference room and collecting conference audio signals during a conference;
the beam shaping device 102 is configured to construct an adaptive beam shaping model based on the acquired audio signal, perform optimization solution on the constructed adaptive beam shaping model, acquire optimal model parameters, and input the acquired noise-containing frequency signal into the optimized adaptive beam shaping model;
the echo cancellation module 103 is configured to optimize the constructed multi-conference echo cancellation model, obtain optimal multi-conference echo cancellation model parameters, and input the shaped noise-removed audio signal into the optimized multi-conference echo cancellation model to obtain an enhanced and optimized conference audio signal.
In detail, the modules in the adaptive beam-shaping intelligent conference audio processing system 100 in the embodiment of the present invention use the same technical means as the adaptive beam-shaping intelligent conference audio processing method described in fig. 1, and can generate the same technical effects, which are not described herein.
Example 3:
fig. 3 is a schematic structural diagram of an electronic device for implementing an intelligent conference audio processing method for adaptive beam shaping according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication interface 13 and a bus, and may further comprise a computer program, such as program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may in other embodiments also be an external storage device of the electronic device 1, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes of the program 12, but also for temporarily storing data that has been output or is to be output.
The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects respective parts of the entire electronic device using various interfaces and lines, executes or executes programs or modules (a program 12 for realizing intelligent conference audio processing, etc.) stored in the memory 11, and invokes data stored in the memory 11 to perform various functions of the electronic device 1 and process data.
The communication interface 13 may comprise a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device 1 and other electronic devices and to enable connection communication between internal components of the electronic device.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc.
Fig. 3 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.
For example, although not shown, the electronic device 1 may further include a power source (such as a battery) for supplying power to each component, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 1 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
The electronic device 1 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
Disposing a plurality of microphones in a conference room and collecting conference audio signals during the conference, and constructing an adaptive beam shaping model based on the collected audio signals;
carrying out optimization solution on the constructed self-adaptive beam shaping model to obtain an optimal self-adaptive beam shaping model;
inputting the acquired audio signals into an optimal self-adaptive beam shaping model to obtain shaped denoising conference audio signals;
constructing a multi-conference echo cancellation model, and optimizing the constructed multi-conference echo cancellation model to obtain an optimal multi-conference echo cancellation model;
and inputting the shaped denoising conference audio signal into an optimal conference echo cancellation model to obtain the enhanced and optimized conference audio signal.
Specifically, the specific implementation method of the above instructions by the processor 10 may refer to descriptions of related steps in the corresponding embodiments of fig. 1 to 3, which are not repeated herein.
It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.
Claims (9)
1. An intelligent conference audio processing method of adaptive beam shaping, the method comprising:
S1: disposing a plurality of microphones in a conference room and collecting conference audio signals during the conference, and constructing an adaptive beam shaping model based on the collected audio signals;
s2: carrying out optimization solution on the constructed self-adaptive beam shaping model to obtain an optimal self-adaptive beam shaping model;
s3: inputting the acquired audio signals into an optimal self-adaptive beam shaping model to obtain shaped denoising conference audio signals;
s4: constructing a multi-conference echo cancellation model, wherein the multi-echo cancellation model takes the shaped noise-removed frequency signal as input and takes the audio signal after echo cancellation as output;
s5: optimizing the constructed multi-conference echo cancellation model to obtain an optimal multi-conference echo cancellation model;
s6: and inputting the shaped denoising conference audio signal into an optimal conference echo cancellation model to obtain the enhanced and optimized conference audio signal.
2. The intelligent conference audio processing method according to claim 1, wherein said step S1 of disposing a plurality of microphones in a conference room and collecting conference audio signals during the conference includes:
disposing m microphones in a conference room to form a microphone array, wherein the structure of the microphone array is a circular structure with a radius of R, the included angles of adjacent microphones are beta, the distance between the adjacent microphones is d, the included angles of the adjacent microphones all take the center of the circle of the circular structure as the vertex, and the distance between the adjacent microphones is the straight line distance between the adjacent microphones;
The method comprises the steps that a sound source in a conference room sends out audio signals, different microphones in a microphone array all receive the conference audio signals, the conference audio signals comprise audio signals sent out by the sound source, environment noise signals and echo signals, and the position of the sound source is an external area of the microphone array;
numbering m microphones in a microphone array in a clockwise order by taking the position of a microphone in the forward east direction as a starting point, wherein the conference audio signal received by the ith microphone in the microphone array is x i (t)=s i (t)+n i (t) wherein s i (t) is the audio signal of the sound source received by the ith microphone in the microphone array, including the audio signal and echo signal emitted by the sound source, n i (t) is an ambient noise signal received by an ith microphone in the microphone array, t represents time domain information, and then conference audio signals received by adjacent ith-1 microphones are:
x i-1 (t)=s i (t-τ i,i-1 )+n i-1 (t)
wherein:
τ o,o-1 (θ) Indicating a time delay of the i-1 th microphone receiving the conference audio signal compared to the i-th microphone;
c represents sound velocity, θ represents sound source direction, and is a parameter to be solved;
taking the center of a circle in the microphone array structure as the center, taking the west-east direction as the longitudinal axis and taking the north-south direction as the transverse axis to construct a coordinate system, so as to obtain the position angles of different microphones in the microphone array, wherein the sound source direction is the connection line of the sound source position and the center of the circle, and the included angle between the connection result and the half axis on the longitudinal axis is the sound source direction;
Wherein the angle between the ith microphone and the half axis on the longitudinal axis is (i-1) beta, which represents the direction of the ith microphone.
3. The intelligent conference audio processing method according to claim 1, wherein the step S1 of constructing an adaptive beam shaping model based on the collected audio signal comprises:
and constructing an adaptive beam shaping model based on the acquired audio signals, wherein the adaptive beam shaping model comprises a sound source direction estimation layer and an adaptive beam shaping layer, the sound source direction estimation layer is used for estimating the sound source direction according to the acquired audio signals, the adaptive beam shaping layer is used for screening the audio signals received by the microphones in the corresponding directions according to the sound source direction, and filtering, noise reduction and audio signal enhancement processing are carried out on the screened audio signals by utilizing an adaptive shaping filter.
4. The intelligent conference audio processing method of adaptive beam shaping according to claim 3, wherein in step S2, the optimizing solution of the constructed adaptive beam shaping model includes:
the adaptive beam shaping model optimizing and solving part is an adaptive beam shaping layer, wherein the adaptive beam shaping layer optimally updates the weights of conference audio signals acquired by microphones in different directions according to the solved sound source direction, inputs the acquired conference audio signals into the updated adaptive beam shaping layer, and obtains conference audio signal results mainly comprising sound source audio by performing delay compensation and weighting combination processing on the conference audio signals acquired by the microphones in different directions;
The optimization solving flow of the self-adaptive beam shaping layer is as follows:
s21: constructing an optimization objective function of a weight matrix in the adaptive beam shaping layer:
wherein:
m represents a row matrix formed by conference audio signals collected by M microphones after delay compensation;
w represents a weight matrix comprising m weight values, wherein W is a row matrix;
indicating the direction is +.>A superposition signal value of a result after time delay compensation of the conference audio signal acquired by the microphone in between;
s22: for the objective function G (W), a set of weight matrices W are randomly generated 0 Setting the current iteration number of the optimization solution as k, and setting the initial value of k as 1;
Wherein:
i is an identity matrix;
s k =c k+1 -c k ;
b k =g k+1 -g k ;
c k c is the standing point of the objective function after the kth iteration 0 =W 0 ;
g k The derivative of the objective function at the kth iteration, wherein the weight matrix is a variable;
D 0 is a unit matrix;
s24: solving the dwell point c at the (k+1) th iteration k+1 :
c 0 =u 0 ;
S25: calculation c k+1 And c k Rank and eigenvalue of c k+1 And c k Rank and eigenvalue are the same, then c k+1 And if not, let k=k+1, returning to the step S23.
5. The intelligent conference audio processing method according to claim 4, wherein the step S3 of inputting the collected audio signal into an optimal adaptive beam shaping model for shaping processing includes:
The shaping flow of the collected conference audio signals based on the adaptive beam shaping model is as follows:
s31: fourier transform processing is performed on the collected conference audio signals to obtain fourier spectrums of the conference audio signals collected by different microphones, wherein the fourier spectrums of the conference audio signals collected by the ith microphone and the i-1 th microphone are expressed as follows:
wherein:
ω represents frequency, the horizontal axis in the fourier spectrum represents frequency, and the vertical axis represents the amplitude of the frequency signal;
j represents an imaginary unit, j 2 =-1;
X i (ω) represents a fourier spectrum representation of the conference audio signal acquired by the ith microphone;
s32: noise signals in conference audio signals collected by different microphones are uncorrelated, and phase difference information of Fourier spectrum representation of the conference audio signals collected by the different microphones is constructed:
wherein:
re (·) represents the real part taking operation;
s33: determining a sound source direction based on phase difference information of different microphones:
wherein:
to solve for the direction of the sound source of sound in the conference room will be such that +.>The sound source direction reaching the maximum is used as a solving result;
S34: and (3) based on the determined sound source direction, optimizing and solving an adaptive beam shaping layer of the adaptive beam shaping model, inputting the acquired conference audio signals into the adaptive beam shaping layer, and performing time delay compensation and weighting and combining processing based on a weight matrix on the conference audio signals acquired by the microphones in different directions by the adaptive beam shaping layer to obtain a shaping result of the acquired audio signals.
6. The intelligent conference audio processing method according to claim 1, wherein the constructing a multi-conference echo cancellation model in step S4 includes:
the method comprises the steps of constructing a multi-conference echo cancellation model, wherein the multi-conference echo cancellation model comprises an expected audio signal calculation layer and an echo cancellation layer, the expected audio signal calculation layer is used for receiving a shaped conference audio signal of a previous period, removing noise and echo signals of the shaped conference audio signal based on signal conversion processing, taking the audio signal from which the noise and echo signals are removed as the expected audio signal of a current period, the echo cancellation layer is structured as a signal filter, the echo cancellation layer optimizes the signal filter in the echo cancellation layer based on the expected audio signal of the previous period and the shaped conference audio signal of the previous period, and carries out quick echo cancellation processing on the shaped conference audio signal of the current period based on the optimized echo cancellation layer, so that an enhanced conference audio signal is quickly obtained;
The expected audio signal calculation flow of the last period in the expected audio signal calculation layer is as follows:
s41: shaping post-meeting audio signal x of last period ′ (t ′ ) Input into the desired audio signal calculation layer, where t ′ Time domain information representing a previous period;
s42: calculating the transformation result of the shaped conference audio signals under different decomposition scales:
wherein:
delta (t) represents a signal decomposition function;
a represents a decomposition scale;
d (a) represents a decomposition coefficient at the decomposition scale a;
s43: deleting the decomposition coefficients with the decomposition coefficients smaller than the preset value threshold, and reconstructing the rest of the decomposition coefficients into audio signals:
wherein:
d' (a) represents the set of decomposition coefficients that are preserved;
x '(t') represents a reconstructed audio signal based on the retained decomposition coefficients;
the reconstructed audio signal x "(t') is taken as the desired audio signal for the last period.
7. The intelligent conference audio processing method according to claim 6, wherein the optimizing the constructed multi-conference echo cancellation model in step S5 includes:
the multi-conference echo cancellation model receives the shaped conference audio signal and the expected audio signal of the previous period, optimizes a signal filter in an echo cancellation layer to obtain an echo cancellation layer which can be used for echo cancellation of the shaped conference audio signal of the current period, and the optimization flow of the echo cancellation layer is as follows:
S51: acquiring a shaped conference audio signal x '(t'), an expected audio signal x '(t') and an echo cancellation layer coefficient of the last period, wherein the echo cancellation layer coefficient is a weight vector with the same length as the audio signal, and the limited convolution sum result of the echo cancellation layer coefficient and the shaped conference audio signal is the audio signal after echo cancellation;
s52: let the echo cancellation layer coefficient of the previous period be L (0), then the echo cancellation layer coefficient obtained by the h iteration is L (h);
s53: calculating an echo cancellation audio signal f (h) based on the h iteration result:
f(h)=L(h)x′(t′)
calculating to obtain an iteration error of the h iteration: e (h) =x "(t') -f (h);
if e (h) is smaller than the threshold value, L (h) is an echo cancellation layer coefficient obtained by optimizing and solving;
s54: designing a variable step factor for echo cancellation coefficient update:
μ(h)=ln(1+|e(h)| s )
wherein:
s is a variable length coefficient, and is set to 0.5;
S56: and obtaining an echo cancellation layer coefficient L (h+1) of the h+1th iteration based on the updated coefficient and the variable step factor:
let h=h+1, return to step S53;
and constructing an optimal conference echo cancellation model based on the optimal solution result of the audio signal at the last moment.
8. The intelligent conference audio processing method according to claim 7, wherein in step S6, the shaped denoising conference audio signal is input into an optimal conference echo cancellation model for echo cancellation processing, and the method comprises:
and inputting the shaped denoising conference audio signal at the current moment into an optimal conference echo cancellation model for echo cancellation processing, and obtaining and playing the conference audio signal finally enhanced at the current moment.
9. An adaptive beam-shaping intelligent conference audio processing system, the system comprising:
an audio signal acquisition device for deploying a plurality of microphones in a conference room and collecting conference audio signals during a conference;
the beam shaping device is used for constructing a self-adaptive beam shaping model based on the collected audio signals, optimizing the constructed multi-conference echo cancellation model to obtain an optimal multi-conference echo cancellation model, and inputting the collected noise-containing frequency signals into the self-adaptive beam shaping model after optimization solution;
the echo cancellation module is used for optimizing the constructed multi-conference echo cancellation model to obtain optimal multi-conference echo cancellation model parameters, inputting the shaped noise-removed audio signals into the optimized multi-conference echo cancellation model to obtain enhanced and optimized conference audio signals so as to realize the intelligent conference audio processing method for self-adaptive beam shaping according to claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310012812.7A CN116110419A (en) | 2023-01-05 | 2023-01-05 | Intelligent conference audio processing method and system for self-adaptive beam shaping |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310012812.7A CN116110419A (en) | 2023-01-05 | 2023-01-05 | Intelligent conference audio processing method and system for self-adaptive beam shaping |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116110419A true CN116110419A (en) | 2023-05-12 |
Family
ID=86265018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310012812.7A Withdrawn CN116110419A (en) | 2023-01-05 | 2023-01-05 | Intelligent conference audio processing method and system for self-adaptive beam shaping |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116110419A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117172135A (en) * | 2023-11-02 | 2023-12-05 | 山东省科霖检测有限公司 | Intelligent noise monitoring management method and system |
-
2023
- 2023-01-05 CN CN202310012812.7A patent/CN116110419A/en not_active Withdrawn
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117172135A (en) * | 2023-11-02 | 2023-12-05 | 山东省科霖检测有限公司 | Intelligent noise monitoring management method and system |
CN117172135B (en) * | 2023-11-02 | 2024-02-06 | 山东省科霖检测有限公司 | Intelligent noise monitoring management method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10123113B2 (en) | Selective audio source enhancement | |
CN108122563B (en) | Method for improving voice awakening rate and correcting DOA | |
WO2019080553A1 (en) | Microphone array-based target voice acquisition method and device | |
KR100878992B1 (en) | Geometric source separation signal processing technique | |
US11651772B2 (en) | Narrowband direction of arrival for full band beamformer | |
CN110853663A (en) | Speech enhancement method based on artificial intelligence, server and storage medium | |
US20120322511A1 (en) | De-noising method for multi-microphone audio equipment, in particular for a "hands-free" telephony system | |
JP2003337594A (en) | Voice recognition device, its voice recognition method and program | |
CN111435598B (en) | Voice signal processing method, device, computer readable medium and electronic equipment | |
CN110265054B (en) | Speech signal processing method, device, computer readable storage medium and computer equipment | |
CN112735461B (en) | Pickup method, and related device and equipment | |
CN112201273B (en) | Noise power spectral density calculation method, system, equipment and medium | |
CN116110419A (en) | Intelligent conference audio processing method and system for self-adaptive beam shaping | |
CN113782044B (en) | Voice enhancement method and device | |
CN113050035B (en) | Two-dimensional directional pickup method and device | |
CN110808058B (en) | Voice enhancement method, device, equipment and readable storage medium | |
US20200327887A1 (en) | Dnn based processor for speech recognition and detection | |
CN107393553B (en) | Auditory feature extraction method for voice activity detection | |
CN115662394A (en) | Voice extraction method, device, storage medium and electronic device | |
JP5228903B2 (en) | Signal processing apparatus and method | |
CN110661510B (en) | Beam former forming method, beam forming device and electronic equipment | |
JP7112269B2 (en) | Directional sound pickup device and program | |
CN113053408B (en) | Sound source separation method and device | |
CN117275502A (en) | Method, system and computer readable storage medium for dereverberation | |
CN117528305A (en) | Pickup control method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230512 |