CN117169812A - Sound source positioning method based on deep learning and beam forming - Google Patents

Sound source positioning method based on deep learning and beam forming Download PDF

Info

Publication number
CN117169812A
CN117169812A CN202311134116.XA CN202311134116A CN117169812A CN 117169812 A CN117169812 A CN 117169812A CN 202311134116 A CN202311134116 A CN 202311134116A CN 117169812 A CN117169812 A CN 117169812A
Authority
CN
China
Prior art keywords
audio signal
sound source
beam forming
microphone array
deep learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311134116.XA
Other languages
Chinese (zh)
Inventor
董明荣
杨宜璇
沈韬
曾凯
蔡云麒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202311134116.XA priority Critical patent/CN117169812A/en
Publication of CN117169812A publication Critical patent/CN117169812A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a sound source positioning method based on deep learning and beam forming, and belongs to the technical field of sound source positioning. Firstly, 4 microphones are formed into a microphone array at fixed intervals according to requirements, the microphone array is used for obtaining audio, then the original audio is converted into Mel frequency spectrum for feature extraction, then the TDNN network is used for classifying the audio, and finally, the beam forming method is further used for carrying out sound source localization on the multichannel audio according to the audio classification result. The invention combines the deep learning and the beam forming method, reduces the required data marking cost and can flexibly change the target of the required sound source positioning.

Description

Sound source positioning method based on deep learning and beam forming
Technical Field
The invention relates to a sound source positioning method based on deep learning and beam forming, and belongs to the technical field of sound source positioning.
Background
Sound source localization refers to determining the source location of sound. In many applications, such as intelligent audio monitoring, robotic navigation, and speech enhancement, accurately locating the sound source location is critical to providing a better user experience and performance. Beamforming is a technique that enhances a signal in a particular direction by reasonably combining the signals received by the sensors. It is commonly applied to microphone arrays to enhance the sound source signal in the direction of interest while suppressing interference in other directions by weighting and phase adjustment. Conventional sound source localization methods use signal processing techniques and sensor arrays such as cross correlation functions, beam differences, least squares, etc. to estimate the direction of the sound source. However, these methods may suffer from problems such as noise, multipath effects, and signal strength variations, resulting in limited positioning accuracy.
Deep learning is a machine learning technique whose core is to learn representations and features of data by building and training deep neural networks. In the sound source localization problem, the deep learning can help to automatically learn complex sound source characteristics, and improve accurate estimation of the sound source position. In recent years, researchers have begun to apply deep learning to sound source localization problems. By using a deep learning algorithm, a more efficient representation of sound source characteristics can be learned from a large amount of data, thereby improving the accuracy and robustness of sound source localization. Deep learning is applied to sound source localization problems. By using a deep learning algorithm, a more efficient representation of sound source characteristics can be learned from a large amount of data, thereby improving the accuracy and robustness of sound source localization. The sound source positioning method based on the deep learning and the beam forming provides an effective solution for solving the sound source positioning problem by combining the strong characteristic learning capability of the deep learning with the directional enhancement capability of the beam forming.
Disclosure of Invention
The invention aims to solve the technical problem of providing a sound source positioning method based on deep learning and beam forming, which is used for solving the problems that the labeling cost of a data set for sound source positioning is high, a specific scene needs to be labeled with a specific data set and the like, so that the flexibility of sound source positioning application is improved, and the cost is reduced.
The technical scheme of the invention is as follows: a sound source localization method based on deep learning and beam forming comprises the following specific steps:
step1: and 4 microphones are formed into a microphone array at fixed intervals according to requirements, and the microphone array is used for acquiring an original audio signal.
The Step1 specifically comprises the following steps:
step1.1: four specific positions are selected to place the microphone under a three-dimensional rectangular coordinate system.
The specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively.
The positions of these microphones constitute a (0, 0) -centered microphone array, where the parameter a is regarded as a fixed value.
Step1.2: a 4-channel raw audio signal is acquired using a microphone array.
Step1.3: noise and interference in the environment are removed by using an adaptive filtering algorithm, and a clean audio signal is extracted from the original audio signal.
Step1.4: the original audio signal after noise reduction is amplified, and the energy of the original audio signal is enhanced, so that the signal is more obvious and easy to distinguish.
Step2: and converting the acquired original audio signal into a Mel frequency spectrum for feature extraction.
The Step2 specifically comprises the following steps:
step2.1: the acquired original audio signal is loaded and resampled to ensure its sampling rate at 16000Hz.
Step2.2: and carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard.
Step2.3: carrying out a overlapping window technology, and carrying out framing treatment on an audio signal by utilizing a Hamming window windowing function so that certain overlapping exists between adjacent frames, wherein the method specifically comprises the following steps:
in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of the window.
Step2.4: the input audio signal is subjected to a fast fourier transform, mapping it from the time domain to the frequency domain:
step2.5: after fast fourier transformation and overlapping windowing, the frequency spectrum of the audio signal is obtained.
Step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
step3: and classifying the extracted audio signals by using a TDNN network.
The Step3 specifically comprises the following steps:
step3.1: and classifying the audio signals after the features are extracted by using a TDNN network.
Step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
Step4: and further utilizing a beam forming method to carry out sound localization on the multichannel audio signals according to the audio signal classification result.
The Step4 specifically comprises the following steps:
step4.1: a sound field grid is created based on the RectGrid function in the Aculolar library to represent the possible range of sound source locations.
Step4.2: and reading the multichannel audio signals acquired by the microphone array from the designated path, and obtaining information such as the sampling rate of the audio signals.
Step4.3: and converting the multichannel audio signals acquired by the microphone array into an h5 format.
Step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the i-th frame of signal x.
Step4.5: microphone positions are plotted and position information of the microphone array is plotted on a graph.
Step4.6: and obtaining a sound field Environment based on the Environment function in the aclar library.
Step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref And the reference sound pressure is used as a reference.
Step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.
The beneficial effects of the invention are as follows: the invention acquires sound source information by using a microphone array, and identifies whether the sound source is a target to be positioned or not by using a TDNN network, and then gives the position information of the sound source. Compared with the prior art, the method mainly solves the problems that the labeling cost of the data set for sound source localization is high, the specific data set needs to be labeled in a specific scene, and the like, increases the flexibility of sound source localization application, and reduces the cost.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a microphone array position diagram of the present invention;
fig. 3 is a position information diagram of the sound source localization of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and detailed description.
Example 1: as shown in fig. 1, a sound source localization method based on deep learning and beam forming is characterized in that:
step1: and 4 microphones are formed into a microphone array at fixed intervals according to requirements, and the microphone array is used for acquiring an original audio signal.
Step2: and converting the acquired original audio signal into a Mel frequency spectrum for feature extraction.
Step3: and classifying the extracted audio signals by using a TDNN network.
Step4: and further utilizing a beam forming method to carry out sound localization on the multichannel audio signals according to the audio signal classification result.
In practical application, because the required sound source localization scenes are different, an appropriate microphone array needs to be configured according to specific requirements, and the specific positions of the microphones are shown in fig. 2, in this embodiment, step1 is specifically:
step1.1: four specific positions are selected to place the microphone under a three-dimensional rectangular coordinate system.
The specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively.
The positions of these microphones constitute a (0, 0) -centered microphone array, where the parameter a is regarded as a fixed value.
Step1.2: a 4-channel raw audio signal is acquired using a microphone array.
Step1.3: noise and interference in the environment are removed by using an adaptive filtering algorithm, and a clean audio signal is extracted from the original audio signal.
Step1.4: the original audio signal after noise reduction is amplified, and the energy of the original audio signal is enhanced, so that the signal is more obvious and easy to distinguish.
Because the original audio signal is a time domain waveform, the original audio signal contains a great deal of detail and noise, and is unfavorable for being directly input into a neural network for processing. Mel-frequency spectrum can better represent speech features in audio by converting an audio signal to the frequency domain. It breaks down the audio in the frequency domain into a series of energy spectra of specific frequencies and is more closely related to the perception of the human ear. Therefore, in this embodiment, step2 is specifically:
step2.1: the acquired original audio signal is loaded and resampled to ensure its sampling rate at 16000Hz.
Step2.2: and carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard.
Step2.3: carrying out a overlapping window technology, and carrying out framing treatment on an audio signal by utilizing a Hamming window windowing function so that certain overlapping exists between adjacent frames, wherein the method specifically comprises the following steps:
in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of the window.
Step2.4: the input audio signal is subjected to a fast fourier transform, mapping it from the time domain to the frequency domain:
step2.5: after fast fourier transformation and overlapping windowing, the frequency spectrum of the audio signal is obtained.
Step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
since the audio signal is a time domain signal, it has temporal sequence. The TDNN network takes time delay into consideration in design, and features in different time ranges are captured through convolution kernels of different time steps, so that timing information can be effectively processed. This enables the TDNN network to better understand and utilize contextual information in audio, which is very useful for audio classification tasks. Therefore, in this embodiment, step3 is specifically:
step3.1: and classifying the audio signals after the features are extracted by using a TDNN network.
Step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
The Step4 specifically comprises the following steps:
step4.1: a sound field grid is created based on the RectGrid function in the Aculolar library to represent the possible range of sound source locations.
Step4.2: and reading the multichannel audio signals acquired by the microphone array from the designated path, and obtaining information such as the sampling rate of the audio signals.
Step4.3: and converting the multichannel audio signals acquired by the microphone array into an h5 format.
Step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the i-th frame of signal x.
Step4.5: microphone positions are plotted and position information of the microphone array is plotted on a graph.
Step4.6: and obtaining a sound field Environment based on the Environment function in the aclar library.
Step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref And the reference sound pressure is used as a reference.
Step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Step2.5: after fast Fourier transformation and overlapping windowing treatment, obtaining the frequency spectrum of the audio signal;
step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
4. the sound source localization method based on deep learning and beam forming according to claim 1, wherein Step3 is specifically:
step3.1: classifying the audio signals after the characteristics are extracted by using a TDNN network;
step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
5. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step4 is specifically:
step4.1: creating a sound field grid based on a RectGrid function in an acicular library, wherein the sound field grid is used for representing a possible range of a sound source position;
step4.2: reading multichannel audio signals acquired by the microphone array from a designated path, and obtaining the sampling rate of the audio signals;
step4.3: converting the multichannel audio signals acquired by the microphone array into an h5 format;
step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the ith frame of signal x;
step4.5: drawing microphone positions, and drawing position information of a microphone array on a graph;
step4.6: obtaining a sound field Environment based on an Environment function in an aclar library;
step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref A reference sound pressure is used as a reference;
step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.

Claims (5)

1. A sound source positioning method based on deep learning and beam forming is characterized in that:
step1: forming 4 microphones into a microphone array at fixed intervals according to requirements, and acquiring an original audio signal by using the microphone array;
step2: converting the acquired original audio signal into a Mel frequency spectrum for feature extraction;
step3: classifying the extracted audio signals by using a TDNN network;
step4: and further utilizing a beam forming method to carry out sound localization on the multichannel audio signals according to the audio signal classification result.
2. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step1 is specifically:
step1.1: selecting four specific positions to place microphones in a three-dimensional rectangular coordinate system;
the specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively;
the positions of these microphones constitute a microphone array centered on (0, 0), where parameter a is considered a fixed value;
step1.2: collecting 4-channel original audio signals by using a microphone array;
step1.3: removing noise and interference in the environment by using an adaptive filtering algorithm, and extracting a pure audio signal from an original audio signal;
step1.4: amplifying the original audio signal after noise reduction, and enhancing the energy of the original audio signal.
3. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step2 is specifically:
step2.1: loading the obtained original audio signal and resampling the audio signal to ensure that the sampling rate is 16000Hz;
step2.2: carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard;
step2.3: carrying out a overlapping window technology, and carrying out framing treatment on an audio signal by utilizing a Hamming window windowing function so that certain overlapping exists between adjacent frames, wherein the method specifically comprises the following steps:
in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of a window;
step2.4: the input audio signal is subjected to a fast fourier transform, mapping it from the time domain to the frequency domain:
step2.5: after fast Fourier transformation and overlapping windowing treatment, obtaining the frequency spectrum of the audio signal;
step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
4. the sound source localization method based on deep learning and beam forming according to claim 1, wherein Step3 is specifically:
step3.1: classifying the audio signals after the characteristics are extracted by using a TDNN network;
step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
5. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step4 is specifically:
step4.1: creating a sound field grid based on a RectGrid function in an acicular library, wherein the sound field grid is used for representing a possible range of a sound source position;
step4.2: reading multichannel audio signals acquired by the microphone array from a designated path, and obtaining the sampling rate of the audio signals;
step4.3: converting the multichannel audio signals acquired by the microphone array into an h5 format;
step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the ith frame of signal x;
step4.5: drawing microphone positions, and drawing position information of a microphone array on a graph;
step4.6: obtaining a sound field Environment based on an Environment function in an aclar library;
step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref A reference sound pressure is used as a reference;
step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.
CN202311134116.XA 2023-09-05 2023-09-05 Sound source positioning method based on deep learning and beam forming Pending CN117169812A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311134116.XA CN117169812A (en) 2023-09-05 2023-09-05 Sound source positioning method based on deep learning and beam forming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311134116.XA CN117169812A (en) 2023-09-05 2023-09-05 Sound source positioning method based on deep learning and beam forming

Publications (1)

Publication Number Publication Date
CN117169812A true CN117169812A (en) 2023-12-05

Family

ID=88940642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311134116.XA Pending CN117169812A (en) 2023-09-05 2023-09-05 Sound source positioning method based on deep learning and beam forming

Country Status (1)

Country Link
CN (1) CN117169812A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117368847A (en) * 2023-12-07 2024-01-09 深圳市好兄弟电子有限公司 Positioning method and system based on microphone radio frequency communication network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117368847A (en) * 2023-12-07 2024-01-09 深圳市好兄弟电子有限公司 Positioning method and system based on microphone radio frequency communication network
CN117368847B (en) * 2023-12-07 2024-03-15 深圳市好兄弟电子有限公司 Positioning method and system based on microphone radio frequency communication network

Similar Documents

Publication Publication Date Title
CN109597022B (en) Method, device and equipment for calculating azimuth angle of sound source and positioning target audio
US10127922B2 (en) Sound source identification apparatus and sound source identification method
CN107577449B (en) Wake-up voice pickup method, device, equipment and storage medium
JP4248445B2 (en) Microphone array method and system, and voice recognition method and apparatus using the same
CN111445920B (en) Multi-sound source voice signal real-time separation method, device and pickup
CN109427328B (en) Multichannel voice recognition method based on filter network acoustic model
CN111044973A (en) MVDR target sound source directional pickup method for microphone matrix
KR20130084298A (en) Systems, methods, apparatus, and computer-readable media for far-field multi-source tracking and separation
CN106872945B (en) Sound source positioning method and device and electronic equipment
CN112904279B (en) Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum
CN108109617A (en) A kind of remote pickup method
CN110875056B (en) Speech transcription device, system, method and electronic device
CN110610718B (en) Method and device for extracting expected sound source voice signal
CN111798869B (en) Sound source positioning method based on double microphone arrays
CN117169812A (en) Sound source positioning method based on deep learning and beam forming
CN112394324A (en) Microphone array-based remote sound source positioning method and system
CN111031463A (en) Microphone array performance evaluation method, device, equipment and medium
CN112859000B (en) Sound source positioning method and device
CN112037813B (en) Voice extraction method for high-power target signal
CN116559778B (en) Vehicle whistle positioning method and system based on deep learning
CN113870893A (en) Multi-channel double-speaker separation method and system
CN114830686A (en) Improved localization of sound sources
CN112363112A (en) Sound source positioning method and device based on linear microphone array
CN114927141B (en) Method and system for detecting abnormal underwater acoustic signals
CN116153324A (en) Virtual array expansion beam forming method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination