CN116106826A

CN116106826A - Sound source positioning method, related device and medium

Info

Publication number: CN116106826A
Application number: CN202211626758.7A
Authority: CN
Inventors: 孙中祥; 秦亚光
Original assignee: Beijing Eswin Computing Technology Co Ltd
Current assignee: Beijing Eswin Computing Technology Co Ltd
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-05-12

Abstract

The present disclosure provides a sound source localization method, related apparatus, and medium, the method comprising: the method comprises the steps of carrying out framing windowing on time domain data of audio signals received by a microphone array to obtain time domain framing data; obtaining frequency domain framing data by performing short-time Fourier transform on the time domain framing data; reconstructing enhanced frequency domain framing data of the frequency domain framing data by performing noise suppression on the frequency domain framing data frame by frame; and based on the enhanced frequency domain framing data, performing spectral peak search by utilizing an adaptive beam forming algorithm to estimate the direction of arrival of the sound source frame by frame. The method reduces the interference of noise on sound source positioning and improves the accuracy of sound source positioning.

Description

Sound source positioning method, related device and medium

Technical Field

The disclosure relates to the technical field of voice positioning, and in particular relates to a sound source positioning method, a related device and a medium.

Background

With the development of artificial intelligence, intelligent voice devices are increasingly used in daily life, such as intelligent sound boxes, intelligent conference pick-up devices, intelligent robots, and the like. The intelligent voice device can capture the sound of surrounding sound sources, and it is necessary to locate the position of the sound source in order to obtain clear voice. The intelligent voice equipment provided with the microphone array can process the audio signals received by the microphone array through a sound source positioning algorithm, so that the direction of a sound source relative to the microphone array, namely the direction of arrival (DOA, direction of arrival) of the sound source is positioned. At present, the direction of arrival of a localized sound source can be realized through a sound source localization algorithm based on time delay estimation, a sound source localization algorithm based on high-resolution spectrum estimation, a sound source localization algorithm based on controllable beam forming and the like. However, since the microphone array is often in a noisy environment, even a strong noise environment, the above-described sound source localization algorithm is susceptible to interference from noise in the environment, thereby reducing the accuracy of sound source localization.

Disclosure of Invention

In order to solve the technical problems, the disclosure provides a sound source positioning method, a related device and a medium, which can reduce the interference of noise on sound source positioning and improve the accuracy of sound source positioning.

According to a first aspect of the present disclosure, there is provided a sound source localization method including:

the method comprises the steps of carrying out framing windowing on time domain data of audio signals received by a microphone array to obtain time domain framing data;

obtaining frequency domain framing data by performing short-time Fourier transform on the time domain framing data;

reconstructing enhanced frequency domain framing data of the frequency domain framing data by performing noise suppression on the frequency domain framing data frame by frame;

and based on the enhanced frequency domain framing data, performing spectral peak search by utilizing an adaptive beam forming algorithm to estimate the direction of arrival of the sound source frame by frame.

Optionally, the frequency domain framing data includes frequency domain framing data of a plurality of channels, and reconstructing the enhanced frequency domain framing data of the frequency domain framing data by noise suppressing the frequency domain framing data frame by frame includes:

calculating an estimated noise covariance of frequency domain framing data of a current frame of a first channel of the plurality of channels;

Calculating a gain function of frequency domain framing data of the current frame of the first channel before and after noise suppression by using the estimated noise covariance;

reconstructing the enhanced frequency domain framing data of the current frame of the frequency domain framing data of the plurality of channels after noise suppression based on the amplitude values of the frequency domain framing data of the plurality of channels of the current frame and the gain function.

Optionally, after the time domain frame data is obtained by performing frame windowing on the time domain data of the audio signal received by the microphone array, the sound source positioning method further includes:

and carrying out voice activity detection on the time domain frame data frame by utilizing a voice activity detection technology to obtain a voice activity detection result of the time domain frame data of each frame, wherein whether voice exists in the time domain frame data of each frame is determined according to the voice activity detection result.

Optionally, the calculating the estimated noise covariance of the frequency domain framing data of the current frame of the first channel of the plurality of channels comprises:

determining a frequency domain representation of frequency domain framing data of a current frame of the first channel based on the voice activity detection result;

Calculating estimated noise covariance of the frequency domain framing data of the current frame of the first channel according to the frequency domain representation of the frequency domain framing data of the current frame of the first channel;

and if the current frame of the first channel does not have voice, taking the weighted sum of the estimated noise covariance of the frequency domain framing data of the previous frame of the first channel and the power of the frequency domain framing data of the current frame of the first channel as the estimated noise covariance of the frequency domain framing data of the current frame of the first channel.

Optionally, calculating the gain function of the frequency domain framing data of the current frame of the first channel before and after noise suppression using the estimated noise covariance includes:

calculating a posterior signal-to-noise ratio of the frequency domain framing data of the current frame of the first channel based on the estimated noise covariance and the power of the frequency domain framing data of the current frame of the first channel;

taking a weighted average of the prior signal-to-noise ratio of the frequency domain framing data of the previous frame of the first channel and the posterior signal-to-noise ratio of the frequency domain framing data of the current frame of the first channel as the prior signal-to-noise ratio of the frequency domain framing data of the current frame of the first channel based on a decision-directed estimation mode;

And calculating a gain function of the frequency domain framing data of the current frame of the first channel before and after noise suppression by taking the minimum mean square error estimated by the spectrum amplitude as a distortion measurement criterion based on the posterior signal-to-noise ratio and the prior signal-to-noise ratio of the frequency domain framing data of the current frame of the first channel.

Optionally, the adaptive beamforming algorithm includes a minimum variance undistorted response beamforming algorithm, and the performing, based on the enhanced frequency domain framing data, a spectral peak search with the adaptive beamforming algorithm to estimate a direction of arrival of the sound source frame by frame includes:

taking the enhanced frequency domain framing data of the current frame of the channels as the input of the minimum variance undistorted response beamforming algorithm, and solving the optimal weight vector of the current frame of the minimum variance undistorted response beamforming algorithm by using a Lagrange multiplier method under the condition that constraint conditions are met;

calculating an average power of an output beam of the current frame of the minimum variance distortion-free response beamforming algorithm based on the optimal weight vector of the current frame and the enhanced frequency domain framing data of the current frame of the plurality of channels;

and in the angle scanning range, carrying out spectrum peak search on the average power of the output beam of the current frame, and taking the signal incidence angle corresponding to the peak point as the estimated direction of arrival of the sound source of the current frame.

Optionally, the constraint includes that the signal in the desired incident direction passes completely through and interference sources and noise in other directions are suppressed to the greatest extent.

Optionally, after performing a spectral peak search by using an adaptive beamforming algorithm to estimate the direction of arrival of the sound source frame by frame based on the enhanced frequency domain framing data, the sound source localization method further includes:

and carrying out smoothing processing on the estimated direction of arrival of the sound source of the current frame based on the voice activity detection result, wherein if voice exists in the time-domain frame data of the current frame, the estimated direction of arrival of the sound source of the current frame is determined to be the final direction of arrival of the sound source of the current frame, and if voice does not exist in the time-domain frame data of the current frame, the estimated direction of arrival of the sound source of the previous frame is determined to be the final direction of arrival of the sound source of the current frame.

According to a second aspect of the present disclosure, there is provided a sound source localization apparatus comprising:

the framing and windowing processing unit is configured to obtain time domain framing data by performing framing and windowing processing on time domain data of the audio signals received by the microphone array;

the short-time Fourier transform unit is configured to obtain frequency domain framing data by carrying out short-time Fourier transform on the time domain framing data;

A frequency domain framing data enhancement unit configured to reconstruct enhanced frequency domain framing data of the frequency domain framing data by noise suppressing the frequency domain framing data frame by frame;

and a direction of arrival estimation unit configured to perform a spectral peak search using an adaptive beamforming algorithm to estimate a direction of arrival of a sound source frame by frame based on the enhanced frequency domain framing data.

Optionally, the sound source positioning device further comprises:

and the voice activity detection unit is configured to detect voice activity of the time domain frame data frame by utilizing a voice activity detection technology to obtain a voice activity detection result of the time domain frame data of each frame, wherein whether voice exists in the time domain frame data of each frame is determined according to the voice activity detection result.

According to a third aspect of the present disclosure, there is provided a sound source localization system comprising:

a microphone array configured to receive time domain data of an audio signal from a sound source;

sound source localization means configured to perform the method of any of the above.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the method as described above.

According to a fifth aspect of the present disclosure there is provided a storage medium having stored thereon a computer program or instructions which when executed by a processor, implement the steps of the method as described above.

According to the embodiment of the disclosure, the time domain frame data is obtained by carrying out frame windowing on the time domain data of the audio signals received by the microphone array, the frequency domain frame data is obtained by carrying out short-time Fourier transform on the time domain frame data, the enhanced frequency domain frame data of the frequency domain frame data is reconstructed by carrying out noise suppression on the frequency domain frame data frame by frame, and the direction of arrival of a sound source is estimated frame by utilizing a self-adaptive beam forming algorithm based on the enhanced frequency domain frame data, so that the frequency domain frame data of the audio signals of a plurality of channels received by the microphone array are preprocessed by a noise suppression technology before sound source positioning is carried out, the signal to noise ratio of the audio signals received by the microphone array is improved by reconstructing the enhanced frequency domain frame data after noise suppression, and the accuracy of sound source positioning is improved.

The enhanced frequency domain framing data of the current frame of the plurality of channels is used as the input of the self-adaptive beam forming algorithm, the self-adaptive beam forming algorithm is utilized to perform spectrum peak search to estimate the direction of arrival of the sound source frame by frame, so that the direction of arrival of the sound source can be estimated by only executing one short-time Fourier transform on the time domain framing data of the audio signals of the plurality of channels received by the microphone array, the calculated amount is reduced, and the instantaneity of sound source positioning is improved.

By calculating the estimated noise covariance of the frequency domain framing data of the current frame of the first channel in the plurality of channels, calculating the gain function of the frequency domain framing data of the current frame of the first channel before and after noise suppression by using the estimated noise covariance, reconstructing the enhanced frequency domain framing data of the current frame of the frequency domain framing data of the plurality of channels after noise suppression based on the amplitude values of the frequency domain framing data of the plurality of channels of the current frame and the gain function, only the frequency domain framing data of one channel is required to be subjected to noise suppression and the gain function of the frequency domain framing data of the channel is calculated, and the gain function is used as the gain function of all channels to reconstruct the enhanced frequency domain framing data of the plurality of channels after noise suppression, so that the calculated amount is reduced and the instantaneity of sound source positioning is improved.

After the voice activity detection result of the time domain framing data of each frame is obtained, the frequency domain representation of the frequency domain framing data of the current frame of the first channel in the plurality of channels received by the microphone array is determined based on the voice activity detection result, the estimated noise covariance of the frequency domain framing data of the current frame of the first channel is calculated according to the frequency domain representation of the frequency domain framing data of the current frame of the first channel received by the microphone array, and the voice activity detection result is used in the noise estimation process, so that the estimation error of the estimated noise covariance is effectively reduced, and the accuracy of sound source positioning is further improved. In addition, based on the voice activity detection result, the estimated direction of arrival of the sound source of the current frame is subjected to smoothing treatment, abnormal mutation of the estimated direction of arrival of the sound source caused by noise is effectively avoided, the accuracy of the estimated direction of arrival of the sound source is improved, and the accuracy of sound source positioning is further improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Fig. 1 illustrates a schematic configuration of a sound source localization system provided according to an embodiment of the present disclosure;

fig. 2 shows a flow diagram of a sound source localization method provided according to an embodiment of the present disclosure;

FIG. 3 shows a flow diagram for reconstructing enhanced frequency domain framing data by performing noise suppression according to an embodiment of the present disclosure;

FIG. 4 shows a schematic flow chart for estimating the direction of arrival of a sound source on a frame-by-frame basis using a minimum variance undistorted response beamforming algorithm, provided in accordance with an embodiment of the disclosure;

fig. 5 illustrates a schematic structural view of a sound source localization apparatus provided according to an embodiment of the present disclosure;

fig. 6 illustrates a schematic structural diagram of an electronic device provided according to an embodiment of the present disclosure.

Detailed Description

In order that the disclosure may be understood, a more complete description of the disclosure will be rendered by reference to the appended drawings. Preferred embodiments of the present disclosure are shown in the drawings. This disclosure may, however, be embodied in different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

The following concepts are used herein:

microphone array: is formed by arranging a certain number of sound collecting devices according to a preset topological structure so as to sample and process the audio signals from different directions in space. Each sound collecting device in the microphone array may be referred to as an array element, each microphone array comprising at least two array elements. As one example, the sound collection device may be an acoustic sensor or microphone. The microphone array may be classified into a linear microphone array, a planar microphone array, a stereo microphone array, and the like according to the topology of the microphone array.

Sound source localization: under the condition that the sound source direction is unknown, array elements are arranged according to a set topological structure to form a microphone array, and audio signals collected by the microphone array are processed through a corresponding algorithm, so that the direction of the sound source relative to the microphone array, namely the direction of arrival (DOA, direction of Arrival) of the sound source is obtained.

Voice activity detection (Voice Activity Detection, VAD): also known as speech end point detection and speech boundary detection, refer to detecting the presence or absence of speech in a noisy environment.

Noise suppression: the audio signal is filtered, i.e. noise is filtered from the audio signal to extract the audio signal in the direction of the target sound source, and the audio signal generated by other directions including speaker or background noise is restrained or eliminated to obtain or recover a clearer audio signal.

Minimum variance distortion-free response beamforming algorithm (minimum variance distortionless response, MVDR): is an adaptive beamforming algorithm based on a maximum signal-to-noise ratio (Signal Noise Ratio, SNR) criterion that can adaptively minimize the power of the microphone array output beam in the desired direction while maximizing the signal-to-noise ratio.

BeamForming): the output of each array element of the microphone array is subjected to delay or phase compensation and amplitude weighting treatment to form a beam in a specific direction, so that signals in the beam can be picked up, noise outside the beam is eliminated, and the purpose of voice enhancement is achieved.

Fig. 1 illustrates a schematic configuration diagram of a sound source localization system provided according to an embodiment of the present disclosure. As shown in fig. 1, the sound source localization system 100 includes an intelligent speech device 110 and a server 120. In some embodiments, the intelligent voice device 110 and the server 120 may be connected through a network, for example, the intelligent voice device 110 and the server 120 may be networked through gprs\4g\wifi or the like.

The intelligent voice device 110 is a device supporting a voice interaction function (such as ticket purchasing machine, pickup device, recording pen, intelligent sound box and mobile communication device), and is provided with a microphone array 111 composed of a plurality of microphone array elements 112. The microphone array 111 may receive audio signals from sound sources of different directions. The server 120 is a server for providing business services for the intelligent voice device 110, and has a voice recognition function, on which a sound source positioning device 121 is disposed, and the direction of a sound source relative to the microphone array 111, that is, the direction of arrival (DOA, direction of Arrival) of the sound source, can be obtained by the sound source positioning device 121, so that the audio signal received by the microphone array 111 can be clearly obtained. The server 120 may acquire instruction information by performing voice recognition on the clearly acquired audio signal, thereby providing the user of the intelligent voice device 110 with related business services. Since the specific steps of estimating the direction of arrival of the sound source based on the audio signals received by the microphone array 111 will be described in detail below, the details are omitted here.

It should be noted that the microphone array 111 in fig. 1 is only an example, and the plurality of microphone array elements 112 may be arranged in other topologies, and the microphone array 111 may be a linear microphone array, a planar microphone array, a stereo microphone array, or the like.

The server 120 may be a server deployed in the cloud, may be a server deployed locally and dedicated to the intelligent voice device 110 to perform voice recognition processing, or may be a server deployed in the intelligent voice device 110 to perform voice recognition processing. For example, in the case where the intelligent voice device 110 is a ticket purchasing machine, the intelligent voice device 110 may be deployed in a subway, a railway station, or the like, and the server 120 may be deployed in a cloud to provide business services to the intelligent voice device 110 in the form of a data center. In the case where the intelligent speech device 110 is a sound pickup device, a sound recording pen, or an intelligent sound box, the intelligent speech device 110 may be deployed at a conference site, and the server 120 may be deployed locally as a server dedicated to the speech recognition processing performed by the intelligent speech device 110. In the case where the intelligent speech device 110 is a mobile communication device, the server 120 may be deployed as a dedicated processor on the intelligent speech device 110.

Fig. 2 shows a flowchart of a sound source localization method provided according to an embodiment of the present disclosure. Referring to fig. 2, the sound source localization method provided by the embodiment of the present disclosure includes steps S210 to S240.

In step S210, time-domain frame data is obtained by performing frame-division windowing on time-domain data of an audio signal received by the microphone array.

In some embodiments, the microphone array is formed by arranging a plurality of microphone array elements in a predetermined topological structure to sample and process audio signals from sound sources in different directions in space. The microphone array may be classified into a linear microphone array, a planar microphone array, a stereo microphone array, and the like according to the topology of the microphone array. The audio signal is understood to be a carrier carrying information of the acoustic frequency and information of the acoustic amplitude variation. The audio signal received by the microphone array is a noisy audio signal, including clean speech and noise, due to the presence of noise. The audio signals received by the microphone array comprise audio signals of a plurality of channels reaching a plurality of microphone array elements from a sound source, and the microphone array elements sample the audio signals received by the microphone array elements by adopting preset sampling frequencies respectively so as to convert the audio signals into digital signals capable of being processed, thereby obtaining time domain data of the channels:

Where y (t) is time domain data of audio signals of a plurality of channels received by the microphone array, y _i And (t) is time domain data of an audio signal of an ith channel received by an ith microphone array element of the microphone array, i is more than or equal to 0 and less than or equal to m, wherein m is an integer, and m is the number of the microphone array elements.

It should be noted that the audio signal is a time-varying non-stationary signal, but has a short-time stationarity, and is generally considered to be relatively stationary in a time range of 10-30ms, so that each frame of signal can be approximately considered as a stationary signal by performing a frame-wise windowing process on continuous time-domain data of a plurality of channels. To increase the consistency between two frames of data, the framing method used may be an overlap segmentation method, where the overlap between frames is a frame shift. In some embodiments, the frame shift may be half the frame length. In addition, the window function can be rectangular window, hanning window, hamming window, triangular window and the like, for example, the hamming window can be selected as the window type, the low-pass characteristic of the hamming window is smooth, and the frequency characteristic of the short-time signal such as voice can be well reflected.

In some embodiments, after continuous time domain data of a plurality of channels received by the microphone array is acquired, a window with a finite length is used to perform framing and windowing processing on the time domain data of each channel, so that time domain framing data can be obtained:

Where y (k, n) is time-domain framing data of the audio signals of the plurality of channels received by the microphone array, y _i (k, n) is time-domain framing data of an audio signal of an i-th channel received by an i-th microphone array element of the microphone array, y (k, n) can be defined as s (k, n) +d (k, n), s (k, n) is time-domain framing data of a clean speech signal received by the microphone array, d (k, n) is time-domain framing data of noise received by the microphone array, 0.ltoreq.i < m, i, m, n, k is an integer, m is the number of microphone array elements, n is a frame index, and k is a frequency point index.

In some embodiments, in a real-world speech environment, there is typically a lot of noise in the audio signals of the multiple channels received by the microphone array. After time domain framing data of audio signals of a plurality of channels received by a microphone array are obtained, voice activity detection is carried out on the time domain framing data frame by utilizing a voice activity detection technology, and a voice activity detection result of the time domain framing data of each frame is obtained. And determining whether voice exists in the time domain framing data of each frame according to the voice activity detection result. In some embodiments, time-domain frame data of audio signals of a plurality of channels received by the microphone array may be fed into a neural network-based voice activity detection model, and the voice frames and the noise frames may be separated by classifying whether voice exists in the time-domain frame data of each frame based on the neural network-based voice activity detection model.

Wherein VAD (-) represents the course of voice activity detection, n is a frame index and is an integer, when the result of voice activity detection is 1, it represents that there is voice in the time domain frame data of the nth frame, the nth frame is a voice frame, when the result of voice activity detection is 0, it represents that there is no voice in the time domain frame data of the nth frame, the nth frame is a noise frame.

Because the voice activity detection model based on the neural network and the training mode thereof belong to the more mature prior art, the description is omitted here.

It should be noted that, in addition to the voice activity detection model based on the neural network, other voice activity detection techniques may be used in the embodiments of the present disclosure to identify whether voice exists in the time-domain frame data of each frame, and any voice activity detection technique capable of separating a voice frame and a noise frame is within the protection scope of the present disclosure.

In step S220, the time-domain frame data is subjected to short-time fourier transform to obtain frequency-domain frame data.

Short-time fourier transform (STFT) is the most widely used method to study non-stationary signals. The basic idea of short-time fourier transform is to divide the signal into many small time intervals and then analyze each time interval with fourier transform to determine the frequency present at each time interval. In some embodiments, the frequency domain frame data of the audio signals of the plurality of channels received by the microphone array is obtained by performing short-time fourier transform on the time domain frame data by using a short-time fourier transform technique:

Y(k,n)＝STFT(y(k,n)) (4)

Where Y (k, n) is frequency domain framing data of the audio signals of the plurality of channels received by the microphone array, Y (k, n) is time domain framing data of the audio signals of the plurality of channels received by the microphone array, Y (k, n) may be defined as S (k, n) +d (k, n), S (k, n) is frequency domain framing data of the clean speech signals received by the microphone array, D (k, n) is frequency domain framing data of noise received by the microphone array, STFT (·) represents a short time fourier transform process, n, k is an integer, n is a frame index, and k is a frequency point index.

In some embodiments, y (k, n) is time-domain framing data of the audio signals of the plurality of channels received by the microphone array, y _i (k, n) is time-domain framing data of the audio signal of the ith channel received by the ith microphone element of the microphone array, so that performing a short-time Fourier transform on y (k, n) can be effectively understood as respectively on y _i (k, n) performing a short-time fourier transform. The frequency domain framing data Y (k, n) of the audio signals of the plurality of channels received by the microphone array may be expressed as:

wherein Y (k, n) is frequency domain framing data of audio signals of a plurality of channels received by the microphone array, Y _i (k, n) is frequency domain framing data of an audio signal of an ith channel received by an ith microphone array element of the microphone array, i is more than or equal to 0 and less than or equal to m, i is an integer, and m is the number of the microphone array elements.

In step S230, enhanced frequency domain framing data of the frequency domain framing data is reconstructed by noise suppressing the frequency domain framing data frame by frame.

The sound source localization is to arrange array elements according to a set topological structure to form a microphone array under the condition that the sound source direction is unknown, and process audio signals collected by the microphone array through a corresponding beam forming algorithm, so as to obtain the direction of the sound source relative to the microphone array, namely the direction of arrival (DOA, direction of Arrival) of the sound source. However, since a large amount of noise is generally present in the audio signals of the plurality of channels received by the microphone array, the large amount of noise interferes with the process of processing the audio signals collected by the microphone array through the corresponding beamforming algorithm, thereby reducing the accuracy of sound source localization. Therefore, it is required to pre-process the frequency domain framing data of the audio signals of the channels received by the microphone array by using the noise suppression technology before performing the sound source localization, and reconstruct the enhanced frequency domain framing data after the noise suppression to improve the signal-to-noise ratio of the audio signals received by the microphone array, thereby improving the accuracy of the sound source localization.

Fig. 3 shows a flow diagram for reconstructing enhanced frequency domain framing data by performing noise suppression according to an embodiment of the present disclosure. Referring to fig. 3, a method of reconstructing enhanced frequency domain framing data by performing noise suppression provided by an embodiment of the present disclosure includes steps S310 to S330.

In step S310, an estimated noise covariance of frequency domain framing data of a current frame of a first channel of the plurality of channels is calculated.

Since the accuracy of noise estimation determines how good the noise suppression effect is, and considering that noise in a real environment generally has a non-uniform effect on a speech spectrum, a time recursive smoothing method based on a speech existence probability is adopted for noise estimation in the present disclosure. In some embodiments, the voice activity detection technique is utilized to detect voice activity on the time-domain framing data frame by frame, and after a voice activity detection result of the time-domain framing data of each frame is obtained, a frequency domain representation of the frequency-domain framing data of the current frame of the first channel of the plurality of channels received by the microphone array is determined based on the voice activity detection result. Based on the voice activity detection result, the frequency domain of the frequency domain framing data of the nth frame of the ith channel is expressed as:

Wherein H is ₀ (k, n) represents a state in which no speech is present in the nth frame time-domain frame data of the ith channel, H ₁ (k, n) represents a state in which a voice is present in the nth frame time domain frame data of the ith channel, Y _i (k, n) is the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array, at H ₀ In the (k, n) state, Y _i (k, n) is defined as D _i (k, n), at H ₁ In the (k, n) state, Y _i (k,n) Defined as S _i (k,n)+D _i (k,n)，S _i (k, n) is the nth frame frequency domain frame data of the clean speech signal of the ith channel received by the microphone array, D _i (k, n) is the nth frame frequency domain frame data of the noise of the ith channel received by the microphone array, i is more than or equal to 0 and less than or equal to m, i, n, k, m is an integer, m is the number of microphone array elements, n is a frame index, and k is a frequency point index.

Then, an estimated noise covariance of the frequency domain framing data of the current frame of the first channel is calculated from the frequency domain representation of the frequency domain framing data of the current frame of the first channel received by the microphone array. In some embodiments, if the current frame of the first channel has speech, the estimated noise covariance of the frequency domain framing data of the previous frame of the first channel is taken as the estimated noise covariance of the frequency domain framing data of the current frame of the first channel, and if the current frame of the first channel has no speech, the weighted sum of the estimated noise covariance of the frequency domain framing data of the previous frame of the first channel and the power of the frequency domain framing data of the current frame of the first channel is taken as the estimated noise covariance of the frequency domain framing data of the current frame of the first channel.

In some embodiments, the noise estimation is performed as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

is the estimated noise covariance of the nth frame frequency domain frame division data of the audio signal of the ith channel received by the microphone array,/th channel>

Is the estimated noise covariance, |Y, of the n-1 frame frequency domain frame data of the audio signal of the ith channel received by the microphone array _i (k,n)| ² Is the nth frame rate of the audio signal of the ith channel received by the microphone arrayPower of field framing data, alpha _d Is a smoothing factor, 0 < alpha _d ＜1。

At H ₀ In the (k, n) state, i.e. in the state where no speech is present in the nth frame time domain frame data of the ith channel, the estimated noise covariance of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array

Estimated noise covariance +_1-1-th frame frequency domain frame data representing an audio signal of an ith channel received by a microphone array>

And a weighted sum of powers of nth frame frequency domain frame division data of the audio signal of the ith channel received by the microphone array. At H ₁ In (k, n) state, i.e. in the presence of speech in the time-domain frame data of the nth frame of the ith channel, the estimated noise covariance of the nth frame frequency-domain frame data of the audio signal of the ith channel received by the microphone array

In step S320, a gain function of frequency domain framing data of the current frame of the first channel before and after noise suppression is calculated using the estimated noise covariance.

In some embodiments, after obtaining the estimated noise covariance, the a priori signal-to-noise ratio and the a posteriori signal-to-noise ratio may be further estimated, thereby obtaining a gain function of the frequency domain framing data of the current frame of the first channel received by the microphone array before and after noise suppression. In some embodiments, the formula for estimating the a posteriori signal-to-noise ratio of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array based on the estimated noise covariance and the power of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array is:

wherein, gamma _k (n) is a posterior signal-to-noise ratio of nth frame frequency domain frame data of an audio signal of an ith channel received by the microphone array,

is an estimated noise covariance, |y, of nth frame frequency domain frame division data of an audio signal of an ith channel received by the microphone array _i (k,n)| ² Is the power of the nth frame frequency domain frame division data of the audio signal of the ith channel received by the microphone array.

In some embodiments, the estimating of the prior signal-to-noise ratio of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array uses an estimating method based on decision guidance, and a weighted average of the estimated prior signal-to-noise ratio of the nth-1 frame frequency domain frame data of the audio signal of the ith channel received by the microphone array and the posterior signal-to-noise ratio of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array is used as the estimated prior signal-to-noise ratio of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array. In some embodiments, the formula for estimating a priori signal to noise ratio of frame-wise frame-divided data of the nth frame frequency domain of the audio signal of the ith channel received by the microphone array is:

wherein, xi _k (n) is an a priori signal-to-noise ratio of frame-wise frame data of an nth frame frequency domain of an audio signal of an ith channel received by the microphone array,

is the estimated noise covariance of the n-1 frame frequency domain frame data of the audio signal of the i-th channel received by the microphone array, |X (k, n-1) | ² Is the power of the n-1 frame enhanced frequency domain framing data of the audio signals received by the microphone array after noise suppression, gamma _k (n) is the posterior signal-to-noise ratio, maxgamma, of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array _k (n)-1,0]Representing selection of gamma _k The larger of (n) -1 and 0, a is the weighting factor.

In some embodiments, after the a priori signal-to-noise ratio and the posterior signal-to-noise ratio are estimated, a minimum mean square error of the log-spectrum amplitude estimation is taken as a distortion measurement criterion, and a gain function of the frequency domain framing data of the current frame of the first channel before and after noise suppression can be obtained by means of statistical knowledge. In some embodiments, the calculation formula of the gain function of the nth frame frequency domain frame division data of the audio signal of the ith channel before and after noise suppression is:

wherein G (ζ) _k (n),v _k (n)) is a gain function, ζ, of frame-wise frame-divided data of the nth frame frequency of the audio signal of the ith channel received by the microphone array before and after noise suppression _k (n) is the a priori signal-to-noise ratio, gamma, of the nth frame frequency domain frame data of the audio signal of the ith channel received by the microphone array _k (n) is a posterior signal-to-noise ratio of nth frame frequency domain frame data of an audio signal of an ith channel received by the microphone array,

in step S330, the enhanced frequency-domain framing data of the current frame of the frequency-domain framing data of the plurality of channels after noise suppression is reconstructed based on the amplitude values of the frequency-domain framing data of the plurality of channels of the current frame and the gain function.

In some embodiments, the plurality of channels received based on the microphone arrayN-th frame frequency domain frame division data Y (k, n) of audio signal and gain function G (ζ) of n-th frame frequency domain frame division data of audio signal of i-th channel received by microphone array before and after noise suppression _k (n),v _k (n)) and the calculation formula of the amplitude value of the enhanced frequency domain framing data of the nth frequency domain framing data of the audio signals of the plurality of channels after the noise suppression is reconstructed is as follows:

X _k,n ＝G(ξ _k (n),v _k (n))Y _k,n (11)

wherein X is _k,n Amplitude value of frame data of frequency domain frame enhancement of nth frame of audio signals of a plurality of channels received by microphone array after noise suppression, G (ζ) _k (n),v _k (n)) is a gain function of nth frame frequency domain frame division data of an audio signal of an ith channel received by the microphone array before and after noise suppression, Y _k,n Is the amplitude value of the nth frame frequency domain frame division data of the audio signals of the plurality of channels received by the microphone array.

In some embodiments, the calculation formula of the enhanced frequency domain framing data of the nth frequency domain framing data of the audio signals of the plurality of channels after noise suppression is reconstructed is as follows:

X(ω _k,n )＝X _k,n e ^jθ′(k,n) (12)

wherein X (omega) _k,n ) Is the nth frame enhanced frequency domain framing data of the audio signals of a plurality of channels received by the microphone array after noise suppression, X _k,n Is the amplitude value of the n-th frame enhanced frequency domain framing data of the audio signals of the plurality of channels received by the microphone array after noise suppression.

In step S240, a spectral peak search is performed using an adaptive beamforming algorithm to estimate the direction of arrival of the sound source frame by frame based on the enhanced frequency domain framing data.

The beam forming is to perform weighted summation processing on the audio signals received by each array element, and adjust the receiving direction of the audio signals by the microphone array to the designated sound source direction according to the preset angle, so as to realize the directional selection of the audio signals by the microphone array, enhance the audio signals in the designated direction and inhibit the interference sound sources and noise in other directions. In the disclosed embodiments, adaptive beamforming (Adaptive Beam Forming, ABF) algorithms (e.g., minimum variance distortion-free response beamforming algorithms) are utilized to estimate the direction of arrival of a sound source on a frame-by-frame basis.

In some embodiments, when the beamforming processing is performed in the desired direction according to the preset angle by using the adaptive beamforming algorithm, the weighting coefficient of each array element output signal is not fixed, but can be automatically adjusted according to the environment and the change of the signal by using the adaptive algorithm, so that the formed beam always points to the preset angle, thereby suppressing the interference source and noise and enhancing the useful signal in the beam. Adaptive beamforming algorithms include, for example, algorithms based on signal-to-noise ratio (Signal Noise Ratio, SNR) maximization criteria, algorithms based on minimum mean square error (Minimum Mean Square Error, MMSE) criteria, and algorithms based on linear constraint minimum variance (Linearly Constraint Minimum Variance, LCMV) criteria, and the like. The specific adaptive beamforming algorithm employed by the embodiments of the present disclosure are not particularly limited. The following describes in detail an example of estimating the direction of arrival of a sound source frame by frame using a minimum variance distortion-free response beamforming algorithm.

Fig. 4 shows a schematic flow chart of estimating the direction of arrival of a sound source frame by frame using a minimum variance distortion-free response beamforming algorithm provided according to an embodiment of the present disclosure. Referring to fig. 4, a method of estimating a direction of arrival of a sound source frame by frame using a minimum variance distortion-free response beamforming algorithm provided by an embodiment of the present disclosure includes steps S410 to S430.

In step S410, the enhanced frequency domain frame data of the current frame of the plurality of channels is used as an input of the minimum variance non-distortion response beamforming algorithm, and the best weight vector of the current frame of the minimum variance non-distortion response beamforming algorithm is solved by using a lagrangian multiplier if the constraint condition is satisfied. Wherein the constraint includes a complete pass of the signal in the desired direction of incidence and a maximum suppression of sources of interference and noise in other directions.

In some embodiments, in order to obtain an audio signal in a desired direction, suppress interfering sound sources and noise in other directions, it is necessary to design a weight vector of a minimum variance distortion-free response beamforming algorithm according to the idea of an adaptive beamforming algorithm. In some embodiments, the formula for weighting the nth frame of enhanced frequency domain framing data of the audio signals of the channels received by the microphone array after noise suppression to obtain the nth frame output beam of the minimum variance distortion-free response beam forming algorithm is as follows:

Where a (θ) is the steering vector of the microphone array, which can be expressed as

X(ω _k,n ) Is the nth frame enhanced frequency domain framing data of the audio signals of the plurality of channels received by the microphone array after noise suppression, S (omega) _k,n ) Is the nth frame frequency domain frame data of the clean voice signals of a plurality of channels received by the microphone array after noise suppression, D (omega) _k,n ) Is the frequency domain data of the nth frame of noise, ω is the weight vector, and m is the number of microphone array elements.

The power of the nth frame output beam of the minimum variance distortion-free response beamforming algorithm can be calculated from equation (13) as:

where P (θ) is the power of the nth frame output beam of the minimum variance distortion-free response beamforming algorithm, a (θ) is the steering vector of the microphone array, ω is the weight vector, r=x (ω) _k,n )X(ω _k,n ) ^H Is an autocorrelation matrix, X (omega _k,n ) Is the nth frame enhanced frequency domain framing data of the audio signals of the plurality of channels received by the microphone array after noise suppression.

Assuming that the desired incidence direction is θ, in order for the signal of the desired incidence direction θ to pass completely, constraint condition (1) needs to be satisfied: omega ^H a (θ) =1; in order to suppress noise and interference signals in other directions to the greatest extent, the constraint condition (2) needs to be satisfied on the basis of satisfying the constraint condition (1): p (θ) is the smallest, and the best weight vector for the nth frame of the minimum variance distortion-free response beamforming algorithm can then be solved using the lagrangian multiplier:

Wherein omega _opt Is the best weight vector for the nth frame of the minimum variance distortion-free response beamforming algorithm, a (θ) is the steering vector for the microphone array, and R is the autocorrelation matrix.

In step S420, an average power of an output beam of the current frame of the minimum variance distortion-free response beamforming algorithm is calculated based on the optimal weight vector of the current frame and the enhanced frequency domain framing data of the current frame of the plurality of channels.

In some embodiments, the best weight vector ω of the nth frame of the minimum variance distortion-free response beamforming algorithm _opt Substituting formula (13), the average power of the output beam of the nth frame of the minimum variance distortion-free response beamforming algorithm can be calculated:

where P' (θ) is the average power of the output beam of the nth frame of the minimum variance distortion-free response beamforming algorithm, a (θ) is the steering vector of the microphone array, and R is the autocorrelation matrix.

In step S430, in the angle scanning range, a spectrum peak search is performed on the average power of the output beam of the current frame, and the signal incident angle corresponding to the peak point is used as the estimated direction of arrival of the sound source of the current frame.

In some embodiments, an angle scanning range is determined according to a topological structure of a microphone array and an actual environment, then average power P' (θ) of an output beam of a current frame of a minimum variance distortion-free response beam forming algorithm is calculated successively according to a certain step length, spectral peak searching is performed, and a signal incident angle θ corresponding to a peak point is used as an estimated direction of arrival of a sound source of the current frame.

When the noise in the environment is large, misjudgment of the minimum variance undistorted response beamforming algorithm is easy to cause, so that the arrival direction of the sound source estimated frame by using the minimum variance undistorted response beamforming algorithm is seriously deviated from the actual sound source position. Thus, in some embodiments, after performing a spectral peak search using an adaptive beamforming algorithm to estimate the direction of arrival of the sound source frame by frame based on enhanced frequency domain frame data of the current frame of the plurality of channels, the estimated direction of arrival of the sound source of the current frame is smoothed based on the voice activity detection result. In some embodiments, a voice activity detection technique is utilized to determine whether voice is present in the time-domain frame data of the current frame, if voice is present in the time-domain frame data of the current frame, the estimated direction of arrival of the sound source of the current frame is determined to be the final direction of arrival of the sound source of the current frame, and if no voice is present in the time-domain frame data of the current frame, the estimated direction of arrival of the sound source of the previous frame is determined to be the final direction of arrival of the sound source of the current frame. Thus, based on the voice activity detection result, the estimated direction of arrival of the sound source of the current frame is smoothed, abnormal mutation of the estimated direction of arrival of the sound source caused by noise is effectively avoided, and the accuracy of the estimated direction of arrival of the sound source is improved.

Further, the embodiment of the disclosure also discloses a sound source positioning device for realizing the aforementioned sound source positioning method. Referring to fig. 5, a sound source localization apparatus 500 disclosed in an embodiment of the present disclosure includes: a framing and windowing processing unit 510, a short-time fourier transform unit 520, a frequency domain framing data enhancement unit 530, a direction of arrival estimation unit 540, and a voice activity detection unit 550.

The framing and windowing processing unit 510 is configured to obtain time-domain framing data by performing framing and windowing processing on time-domain data of an audio signal received by the microphone array. The short-time fourier transform unit 520 is configured to obtain frequency domain framing data by performing short-time fourier transform on the time domain framing data. A frequency domain framing data enhancement unit 530 configured to reconstruct enhanced frequency domain framing data of the frequency domain framing data by noise suppressing the frequency domain framing data frame by frame. The direction of arrival estimation unit 540 is configured to perform a spectral peak search using an adaptive beamforming algorithm to estimate the direction of arrival of the sound source frame by frame based on the enhanced frequency domain framing data. And a voice activity detection unit 550 configured to perform voice activity detection on the time domain frame data frame by using a voice activity detection technology, so as to obtain a voice activity detection result of the time domain frame data of each frame, wherein whether voice exists in the time domain frame data of each frame is determined according to the voice activity detection result.

In the specific implementation, each module/unit in the sound source positioning device can be implemented as a separate entity, or can be arbitrarily combined and implemented as the same entity or a plurality of entities. Meanwhile, the specific implementation of each module/unit in the above-described sound source positioning device may be referred to the foregoing sound source positioning method embodiment, and will not be described herein.

The embodiment of the disclosure also provides a sound source positioning system for realizing the sound source positioning method. Wherein the microphone array is configured to receive time domain data of the audio signal from the sound source. A sound source localization device configured to perform the aforementioned sound source localization method. The specific implementation of each module/unit in the above-described sound source localization system may be referred to the foregoing sound source localization method embodiment, and will not be described herein.

The embodiment of the disclosure further provides an electronic device, as shown in fig. 6, including a memory 620, a processor 610, and a program stored in the memory 620 and capable of running on the processor 610, where the program when executed by the processor 610 can implement each process of each embodiment of the above sound source localization method, and can achieve the same technical effects, and for avoiding repetition, a detailed description is omitted herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions or by controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor. To this end, the embodiments of the present disclosure also provide a storage medium having stored thereon a computer program or instructions which, when executed by a processor, can implement the respective processes of the embodiments of the above-described sound source localization method.

The steps in the sound source positioning method provided by the embodiment of the present disclosure may be executed due to the instructions stored in the storage medium, so that the beneficial effects that can be achieved by the sound source positioning method provided by the embodiment of the present disclosure may be achieved, which are detailed in the previous embodiments and are not described herein. The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

In summary, according to an embodiment of the present disclosure, time domain frame data is obtained by performing frame windowing processing on time domain data of an audio signal received by a microphone array, frequency domain frame data is obtained by performing short-time fourier transform on the time domain frame data, enhanced frequency domain frame data of the frequency domain frame data is reconstructed by performing noise suppression on the frequency domain frame data frame by frame, and a direction of arrival of a sound source is estimated frame by performing spectral peak search using an adaptive beamforming algorithm based on the enhanced frequency domain frame data, so that, before sound source positioning is performed, the frequency domain frame data of the audio signal of a plurality of channels received by the microphone array is preprocessed by a noise suppression technique, the enhanced frequency domain frame data after noise suppression is reconstructed, and a signal to noise ratio of the audio signal received by the microphone array is improved, thereby improving accuracy of sound source positioning.

Finally, it should be noted that: it is apparent that the above examples are merely illustrative of the present disclosure and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. And obvious variations or modifications thereof are contemplated as falling within the scope of the present disclosure.

Claims

1. A sound source localization method, comprising:

2. The sound source localization method of claim 1, wherein the frequency domain framing data comprises frequency domain framing data of a plurality of channels, and wherein reconstructing enhanced frequency domain framing data of the frequency domain framing data by noise suppressing the frequency domain framing data on a frame-by-frame basis comprises:

3. The sound source localization method according to claim 2, wherein after the time-domain frame data is obtained by performing a frame-wise windowing process on the time-domain data of the audio signal received by the microphone array, the sound source localization method further comprises:

4. A sound source localization method as claimed in claim 3, wherein said calculating an estimated noise covariance of frequency domain framing data of a current frame of a first channel of the plurality of channels comprises:

5. The sound source localization method of claim 4, wherein calculating the gain function of the frequency domain framing data of the current frame of the first channel before and after noise suppression using the estimated noise covariance comprises:

6. The sound source localization method of claim 4, wherein the adaptive beamforming algorithm comprises a minimum variance distortion-free response beamforming algorithm, and wherein performing a spectral peak search to estimate the direction of arrival of the sound source frame by frame using the adaptive beamforming algorithm based on the enhanced frequency domain framing data comprises:

7. The sound source localization method of claim 6, wherein the constraint comprises a complete pass of the signal in the desired direction of incidence and a maximum suppression of sources of interference and noise in other directions.

8. The sound source localization method of claim 6, wherein the sound source localization method further comprises, after performing a spectral peak search to estimate a direction of arrival of a sound source frame by frame using an adaptive beamforming algorithm based on the enhanced frequency domain framing data:

9. A sound source localization apparatus, comprising:

10. The sound source localization device of claim 9, further comprising:

11. A sound source localization system, comprising:

a sound source localization device configured to perform the method of any one of claims 1 to 8.

12. An electronic device, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor, performs the steps of the method according to any one of claims 1 to 8.

13. A storage medium having stored thereon a computer program or instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 8.