CN117169812A

CN117169812A - Sound source positioning method based on deep learning and beam forming

Info

Publication number: CN117169812A
Application number: CN202311134116.XA
Authority: CN
Inventors: 董明荣; 杨宜璇; 沈韬; 曾凯; 蔡云麒
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2023-09-05
Filing date: 2023-09-05
Publication date: 2023-12-05

Abstract

The invention relates to a sound source positioning method based on deep learning and beam forming, and belongs to the technical field of sound source positioning. Firstly, 4 microphones are formed into a microphone array at fixed intervals according to requirements, the microphone array is used for obtaining audio, then the original audio is converted into Mel frequency spectrum for feature extraction, then the TDNN network is used for classifying the audio, and finally, the beam forming method is further used for carrying out sound source localization on the multichannel audio according to the audio classification result. The invention combines the deep learning and the beam forming method, reduces the required data marking cost and can flexibly change the target of the required sound source positioning.

Description

Sound source positioning method based on deep learning and beam forming

Technical Field

The invention relates to a sound source positioning method based on deep learning and beam forming, and belongs to the technical field of sound source positioning.

Background

Sound source localization refers to determining the source location of sound. In many applications, such as intelligent audio monitoring, robotic navigation, and speech enhancement, accurately locating the sound source location is critical to providing a better user experience and performance. Beamforming is a technique that enhances a signal in a particular direction by reasonably combining the signals received by the sensors. It is commonly applied to microphone arrays to enhance the sound source signal in the direction of interest while suppressing interference in other directions by weighting and phase adjustment. Conventional sound source localization methods use signal processing techniques and sensor arrays such as cross correlation functions, beam differences, least squares, etc. to estimate the direction of the sound source. However, these methods may suffer from problems such as noise, multipath effects, and signal strength variations, resulting in limited positioning accuracy.

Deep learning is a machine learning technique whose core is to learn representations and features of data by building and training deep neural networks. In the sound source localization problem, the deep learning can help to automatically learn complex sound source characteristics, and improve accurate estimation of the sound source position. In recent years, researchers have begun to apply deep learning to sound source localization problems. By using a deep learning algorithm, a more efficient representation of sound source characteristics can be learned from a large amount of data, thereby improving the accuracy and robustness of sound source localization. Deep learning is applied to sound source localization problems. By using a deep learning algorithm, a more efficient representation of sound source characteristics can be learned from a large amount of data, thereby improving the accuracy and robustness of sound source localization. The sound source positioning method based on the deep learning and the beam forming provides an effective solution for solving the sound source positioning problem by combining the strong characteristic learning capability of the deep learning with the directional enhancement capability of the beam forming.

Disclosure of Invention

The invention aims to solve the technical problem of providing a sound source positioning method based on deep learning and beam forming, which is used for solving the problems that the labeling cost of a data set for sound source positioning is high, a specific scene needs to be labeled with a specific data set and the like, so that the flexibility of sound source positioning application is improved, and the cost is reduced.

The technical scheme of the invention is as follows: a sound source localization method based on deep learning and beam forming comprises the following specific steps:

step1: and 4 microphones are formed into a microphone array at fixed intervals according to requirements, and the microphone array is used for acquiring an original audio signal.

The Step1 specifically comprises the following steps:

step1.1: four specific positions are selected to place the microphone under a three-dimensional rectangular coordinate system.

The specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively.

The positions of these microphones constitute a (0, 0) -centered microphone array, where the parameter a is regarded as a fixed value.

Step1.2: a 4-channel raw audio signal is acquired using a microphone array.

Step1.3: noise and interference in the environment are removed by using an adaptive filtering algorithm, and a clean audio signal is extracted from the original audio signal.

Step1.4: the original audio signal after noise reduction is amplified, and the energy of the original audio signal is enhanced, so that the signal is more obvious and easy to distinguish.

Step2: and converting the acquired original audio signal into a Mel frequency spectrum for feature extraction.

The Step2 specifically comprises the following steps:

step2.1: the acquired original audio signal is loaded and resampled to ensure its sampling rate at 16000Hz.

Step2.2: and carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard.

Step2.3: carrying out a overlapping window technology, and carrying out framing treatment on an audio signal by utilizing a Hamming window windowing function so that certain overlapping exists between adjacent frames, wherein the method specifically comprises the following steps:

in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of the window.

Step2.4: the input audio signal is subjected to a fast fourier transform, mapping it from the time domain to the frequency domain:

step2.5: after fast fourier transformation and overlapping windowing, the frequency spectrum of the audio signal is obtained.

Step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:

step3: and classifying the extracted audio signals by using a TDNN network.

The Step3 specifically comprises the following steps:

step3.1: and classifying the audio signals after the features are extracted by using a TDNN network.

Step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.

Step4: and further utilizing a beam forming method to carry out sound localization on the multichannel audio signals according to the audio signal classification result.

The Step4 specifically comprises the following steps:

step4.1: a sound field grid is created based on the RectGrid function in the Aculolar library to represent the possible range of sound source locations.

Step4.2: and reading the multichannel audio signals acquired by the microphone array from the designated path, and obtaining information such as the sampling rate of the audio signals.

Step4.3: and converting the multichannel audio signals acquired by the microphone array into an h5 format.

Step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:

in the formula (4), x _i Is the i-th frame of signal x.

Step4.5: microphone positions are plotted and position information of the microphone array is plotted on a graph.

Step4.6: and obtaining a sound field Environment based on the Environment function in the aclar library.

Step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.

Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.

Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:

SPL＝20*log ₁₀ ^(P/Pref) (5)

in the formula (5), P is the sound pressure to be calculated, P _ref And the reference sound pressure is used as a reference.

Step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.

The beneficial effects of the invention are as follows: the invention acquires sound source information by using a microphone array, and identifies whether the sound source is a target to be positioned or not by using a TDNN network, and then gives the position information of the sound source. Compared with the prior art, the method mainly solves the problems that the labeling cost of the data set for sound source localization is high, the specific data set needs to be labeled in a specific scene, and the like, increases the flexibility of sound source localization application, and reduces the cost.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a microphone array position diagram of the present invention;

fig. 3 is a position information diagram of the sound source localization of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and detailed description.

Example 1: as shown in fig. 1, a sound source localization method based on deep learning and beam forming is characterized in that:

Step3: and classifying the extracted audio signals by using a TDNN network.

In practical application, because the required sound source localization scenes are different, an appropriate microphone array needs to be configured according to specific requirements, and the specific positions of the microphones are shown in fig. 2, in this embodiment, step1 is specifically:

Step1.2: a 4-channel raw audio signal is acquired using a microphone array.

Because the original audio signal is a time domain waveform, the original audio signal contains a great deal of detail and noise, and is unfavorable for being directly input into a neural network for processing. Mel-frequency spectrum can better represent speech features in audio by converting an audio signal to the frequency domain. It breaks down the audio in the frequency domain into a series of energy spectra of specific frequencies and is more closely related to the perception of the human ear. Therefore, in this embodiment, step2 is specifically:

since the audio signal is a time domain signal, it has temporal sequence. The TDNN network takes time delay into consideration in design, and features in different time ranges are captured through convolution kernels of different time steps, so that timing information can be effectively processed. This enables the TDNN network to better understand and utilize contextual information in audio, which is very useful for audio classification tasks. Therefore, in this embodiment, step3 is specifically:

The Step4 specifically comprises the following steps:

in the formula (4), x _i Is the i-th frame of signal x.

SPL＝20*log ₁₀ ^(P/Pref) (5)

While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Step2.5: after fast Fourier transformation and overlapping windowing treatment, obtaining the frequency spectrum of the audio signal;

4. the sound source localization method based on deep learning and beam forming according to claim 1, wherein Step3 is specifically:

step3.1: classifying the audio signals after the characteristics are extracted by using a TDNN network;

5. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step4 is specifically:

step4.1: creating a sound field grid based on a RectGrid function in an acicular library, wherein the sound field grid is used for representing a possible range of a sound source position;

step4.2: reading multichannel audio signals acquired by the microphone array from a designated path, and obtaining the sampling rate of the audio signals;

step4.3: converting the multichannel audio signals acquired by the microphone array into an h5 format;

in the formula (4), x _i Is the ith frame of signal x;

step4.5: drawing microphone positions, and drawing position information of a microphone array on a graph;

step4.6: obtaining a sound field Environment based on an Environment function in an aclar library;

SPL＝20*log ₁₀ ^(P/Pref) (5)

in the formula (5), P is the sound pressure to be calculated, P _ref A reference sound pressure is used as a reference;

Claims

1. A sound source positioning method based on deep learning and beam forming is characterized in that:

step1: forming 4 microphones into a microphone array at fixed intervals according to requirements, and acquiring an original audio signal by using the microphone array;

step2: converting the acquired original audio signal into a Mel frequency spectrum for feature extraction;

step3: classifying the extracted audio signals by using a TDNN network;

2. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step1 is specifically:

step1.1: selecting four specific positions to place microphones in a three-dimensional rectangular coordinate system;

the specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively;

the positions of these microphones constitute a microphone array centered on (0, 0), where parameter a is considered a fixed value;

step1.2: collecting 4-channel original audio signals by using a microphone array;

step1.3: removing noise and interference in the environment by using an adaptive filtering algorithm, and extracting a pure audio signal from an original audio signal;

step1.4: amplifying the original audio signal after noise reduction, and enhancing the energy of the original audio signal.

3. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step2 is specifically:

step2.1: loading the obtained original audio signal and resampling the audio signal to ensure that the sampling rate is 16000Hz;

step2.2: carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard;

in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of a window;

in the formula (4), x _i Is the ith frame of signal x;

SPL＝20*log ₁₀ ^(P/Pref) (5)