CN117169812A - Sound source positioning method based on deep learning and beam forming - Google Patents
Sound source positioning method based on deep learning and beam forming Download PDFInfo
- Publication number
- CN117169812A CN117169812A CN202311134116.XA CN202311134116A CN117169812A CN 117169812 A CN117169812 A CN 117169812A CN 202311134116 A CN202311134116 A CN 202311134116A CN 117169812 A CN117169812 A CN 117169812A
- Authority
- CN
- China
- Prior art keywords
- audio signal
- sound source
- beam forming
- microphone array
- deep learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000013135 deep learning Methods 0.000 title claims abstract description 23
- 230000004807 localization Effects 0.000 claims abstract description 27
- 238000001228 spectrum Methods 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 230000005236 sound signal Effects 0.000 claims description 71
- 230000006870 function Effects 0.000 claims description 11
- 238000010586 diagram Methods 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 8
- 238000009432 framing Methods 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 5
- 229920004439 Aclar® Polymers 0.000 claims description 4
- 239000005023 polychlorotrifluoroethylene (PCTFE) polymer Substances 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 238000012952 Resampling Methods 0.000 claims 1
- 230000002708 enhancing effect Effects 0.000 claims 1
- 238000003491 array Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention relates to a sound source positioning method based on deep learning and beam forming, and belongs to the technical field of sound source positioning. Firstly, 4 microphones are formed into a microphone array at fixed intervals according to requirements, the microphone array is used for obtaining audio, then the original audio is converted into Mel frequency spectrum for feature extraction, then the TDNN network is used for classifying the audio, and finally, the beam forming method is further used for carrying out sound source localization on the multichannel audio according to the audio classification result. The invention combines the deep learning and the beam forming method, reduces the required data marking cost and can flexibly change the target of the required sound source positioning.
Description
Technical Field
The invention relates to a sound source positioning method based on deep learning and beam forming, and belongs to the technical field of sound source positioning.
Background
Sound source localization refers to determining the source location of sound. In many applications, such as intelligent audio monitoring, robotic navigation, and speech enhancement, accurately locating the sound source location is critical to providing a better user experience and performance. Beamforming is a technique that enhances a signal in a particular direction by reasonably combining the signals received by the sensors. It is commonly applied to microphone arrays to enhance the sound source signal in the direction of interest while suppressing interference in other directions by weighting and phase adjustment. Conventional sound source localization methods use signal processing techniques and sensor arrays such as cross correlation functions, beam differences, least squares, etc. to estimate the direction of the sound source. However, these methods may suffer from problems such as noise, multipath effects, and signal strength variations, resulting in limited positioning accuracy.
Deep learning is a machine learning technique whose core is to learn representations and features of data by building and training deep neural networks. In the sound source localization problem, the deep learning can help to automatically learn complex sound source characteristics, and improve accurate estimation of the sound source position. In recent years, researchers have begun to apply deep learning to sound source localization problems. By using a deep learning algorithm, a more efficient representation of sound source characteristics can be learned from a large amount of data, thereby improving the accuracy and robustness of sound source localization. Deep learning is applied to sound source localization problems. By using a deep learning algorithm, a more efficient representation of sound source characteristics can be learned from a large amount of data, thereby improving the accuracy and robustness of sound source localization. The sound source positioning method based on the deep learning and the beam forming provides an effective solution for solving the sound source positioning problem by combining the strong characteristic learning capability of the deep learning with the directional enhancement capability of the beam forming.
Disclosure of Invention
The invention aims to solve the technical problem of providing a sound source positioning method based on deep learning and beam forming, which is used for solving the problems that the labeling cost of a data set for sound source positioning is high, a specific scene needs to be labeled with a specific data set and the like, so that the flexibility of sound source positioning application is improved, and the cost is reduced.
The technical scheme of the invention is as follows: a sound source localization method based on deep learning and beam forming comprises the following specific steps:
step1: and 4 microphones are formed into a microphone array at fixed intervals according to requirements, and the microphone array is used for acquiring an original audio signal.
The Step1 specifically comprises the following steps:
step1.1: four specific positions are selected to place the microphone under a three-dimensional rectangular coordinate system.
The specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively.
The positions of these microphones constitute a (0, 0) -centered microphone array, where the parameter a is regarded as a fixed value.
Step1.2: a 4-channel raw audio signal is acquired using a microphone array.
Step1.3: noise and interference in the environment are removed by using an adaptive filtering algorithm, and a clean audio signal is extracted from the original audio signal.
Step1.4: the original audio signal after noise reduction is amplified, and the energy of the original audio signal is enhanced, so that the signal is more obvious and easy to distinguish.
Step2: and converting the acquired original audio signal into a Mel frequency spectrum for feature extraction.
The Step2 specifically comprises the following steps:
step2.1: the acquired original audio signal is loaded and resampled to ensure its sampling rate at 16000Hz.
Step2.2: and carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard.
Step2.3: carrying out a overlapping window technology, and carrying out framing treatment on an audio signal by utilizing a Hamming window windowing function so that certain overlapping exists between adjacent frames, wherein the method specifically comprises the following steps:
in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of the window.
Step2.4: the input audio signal is subjected to a fast fourier transform, mapping it from the time domain to the frequency domain:
step2.5: after fast fourier transformation and overlapping windowing, the frequency spectrum of the audio signal is obtained.
Step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
step3: and classifying the extracted audio signals by using a TDNN network.
The Step3 specifically comprises the following steps:
step3.1: and classifying the audio signals after the features are extracted by using a TDNN network.
Step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
Step4: and further utilizing a beam forming method to carry out sound localization on the multichannel audio signals according to the audio signal classification result.
The Step4 specifically comprises the following steps:
step4.1: a sound field grid is created based on the RectGrid function in the Aculolar library to represent the possible range of sound source locations.
Step4.2: and reading the multichannel audio signals acquired by the microphone array from the designated path, and obtaining information such as the sampling rate of the audio signals.
Step4.3: and converting the multichannel audio signals acquired by the microphone array into an h5 format.
Step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the i-th frame of signal x.
Step4.5: microphone positions are plotted and position information of the microphone array is plotted on a graph.
Step4.6: and obtaining a sound field Environment based on the Environment function in the aclar library.
Step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref And the reference sound pressure is used as a reference.
Step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.
The beneficial effects of the invention are as follows: the invention acquires sound source information by using a microphone array, and identifies whether the sound source is a target to be positioned or not by using a TDNN network, and then gives the position information of the sound source. Compared with the prior art, the method mainly solves the problems that the labeling cost of the data set for sound source localization is high, the specific data set needs to be labeled in a specific scene, and the like, increases the flexibility of sound source localization application, and reduces the cost.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a microphone array position diagram of the present invention;
fig. 3 is a position information diagram of the sound source localization of the present invention.
Detailed Description
The invention will be further described with reference to the drawings and detailed description.
Example 1: as shown in fig. 1, a sound source localization method based on deep learning and beam forming is characterized in that:
step1: and 4 microphones are formed into a microphone array at fixed intervals according to requirements, and the microphone array is used for acquiring an original audio signal.
Step2: and converting the acquired original audio signal into a Mel frequency spectrum for feature extraction.
Step3: and classifying the extracted audio signals by using a TDNN network.
Step4: and further utilizing a beam forming method to carry out sound localization on the multichannel audio signals according to the audio signal classification result.
In practical application, because the required sound source localization scenes are different, an appropriate microphone array needs to be configured according to specific requirements, and the specific positions of the microphones are shown in fig. 2, in this embodiment, step1 is specifically:
step1.1: four specific positions are selected to place the microphone under a three-dimensional rectangular coordinate system.
The specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively.
The positions of these microphones constitute a (0, 0) -centered microphone array, where the parameter a is regarded as a fixed value.
Step1.2: a 4-channel raw audio signal is acquired using a microphone array.
Step1.3: noise and interference in the environment are removed by using an adaptive filtering algorithm, and a clean audio signal is extracted from the original audio signal.
Step1.4: the original audio signal after noise reduction is amplified, and the energy of the original audio signal is enhanced, so that the signal is more obvious and easy to distinguish.
Because the original audio signal is a time domain waveform, the original audio signal contains a great deal of detail and noise, and is unfavorable for being directly input into a neural network for processing. Mel-frequency spectrum can better represent speech features in audio by converting an audio signal to the frequency domain. It breaks down the audio in the frequency domain into a series of energy spectra of specific frequencies and is more closely related to the perception of the human ear. Therefore, in this embodiment, step2 is specifically:
step2.1: the acquired original audio signal is loaded and resampled to ensure its sampling rate at 16000Hz.
Step2.2: and carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard.
Step2.3: carrying out a overlapping window technology, and carrying out framing treatment on an audio signal by utilizing a Hamming window windowing function so that certain overlapping exists between adjacent frames, wherein the method specifically comprises the following steps:
in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of the window.
Step2.4: the input audio signal is subjected to a fast fourier transform, mapping it from the time domain to the frequency domain:
step2.5: after fast fourier transformation and overlapping windowing, the frequency spectrum of the audio signal is obtained.
Step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
since the audio signal is a time domain signal, it has temporal sequence. The TDNN network takes time delay into consideration in design, and features in different time ranges are captured through convolution kernels of different time steps, so that timing information can be effectively processed. This enables the TDNN network to better understand and utilize contextual information in audio, which is very useful for audio classification tasks. Therefore, in this embodiment, step3 is specifically:
step3.1: and classifying the audio signals after the features are extracted by using a TDNN network.
Step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
The Step4 specifically comprises the following steps:
step4.1: a sound field grid is created based on the RectGrid function in the Aculolar library to represent the possible range of sound source locations.
Step4.2: and reading the multichannel audio signals acquired by the microphone array from the designated path, and obtaining information such as the sampling rate of the audio signals.
Step4.3: and converting the multichannel audio signals acquired by the microphone array into an h5 format.
Step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the i-th frame of signal x.
Step4.5: microphone positions are plotted and position information of the microphone array is plotted on a graph.
Step4.6: and obtaining a sound field Environment based on the Environment function in the aclar library.
Step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref And the reference sound pressure is used as a reference.
Step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.
While the present invention has been described in detail with reference to the drawings, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Step2.5: after fast Fourier transformation and overlapping windowing treatment, obtaining the frequency spectrum of the audio signal;
step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
4. the sound source localization method based on deep learning and beam forming according to claim 1, wherein Step3 is specifically:
step3.1: classifying the audio signals after the characteristics are extracted by using a TDNN network;
step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
5. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step4 is specifically:
step4.1: creating a sound field grid based on a RectGrid function in an acicular library, wherein the sound field grid is used for representing a possible range of a sound source position;
step4.2: reading multichannel audio signals acquired by the microphone array from a designated path, and obtaining the sampling rate of the audio signals;
step4.3: converting the multichannel audio signals acquired by the microphone array into an h5 format;
step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the ith frame of signal x;
step4.5: drawing microphone positions, and drawing position information of a microphone array on a graph;
step4.6: obtaining a sound field Environment based on an Environment function in an aclar library;
step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref A reference sound pressure is used as a reference;
step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.
Claims (5)
1. A sound source positioning method based on deep learning and beam forming is characterized in that:
step1: forming 4 microphones into a microphone array at fixed intervals according to requirements, and acquiring an original audio signal by using the microphone array;
step2: converting the acquired original audio signal into a Mel frequency spectrum for feature extraction;
step3: classifying the extracted audio signals by using a TDNN network;
step4: and further utilizing a beam forming method to carry out sound localization on the multichannel audio signals according to the audio signal classification result.
2. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step1 is specifically:
step1.1: selecting four specific positions to place microphones in a three-dimensional rectangular coordinate system;
the specific positions are (a, a, 0), (-a, a, 0), (-a, -a, 0), (a, -a, 0), respectively;
the positions of these microphones constitute a microphone array centered on (0, 0), where parameter a is considered a fixed value;
step1.2: collecting 4-channel original audio signals by using a microphone array;
step1.3: removing noise and interference in the environment by using an adaptive filtering algorithm, and extracting a pure audio signal from an original audio signal;
step1.4: amplifying the original audio signal after noise reduction, and enhancing the energy of the original audio signal.
3. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step2 is specifically:
step2.1: loading the obtained original audio signal and resampling the audio signal to ensure that the sampling rate is 16000Hz;
step2.2: carrying out decibel normalization processing on the resampled audio signal to ensure that the volume range is under the consistent standard;
step2.3: carrying out a overlapping window technology, and carrying out framing treatment on an audio signal by utilizing a Hamming window windowing function so that certain overlapping exists between adjacent frames, wherein the method specifically comprises the following steps:
in the formula (1), N is more than or equal to 0 and less than or equal to N-1, wherein N is the length of a window;
step2.4: the input audio signal is subjected to a fast fourier transform, mapping it from the time domain to the frequency domain:
step2.5: after fast Fourier transformation and overlapping windowing treatment, obtaining the frequency spectrum of the audio signal;
step2.6: converting the frequency spectrum into a Mel scale to obtain a Mel spectrum, wherein the formula for converting the frequency into the Mel scale is as follows:
4. the sound source localization method based on deep learning and beam forming according to claim 1, wherein Step3 is specifically:
step3.1: classifying the audio signals after the characteristics are extracted by using a TDNN network;
step3.2: judging whether the classification result is the category needing to be positioned so as to carry out the next positioning operation.
5. The sound source localization method based on deep learning and beam forming according to claim 1, wherein Step4 is specifically:
step4.1: creating a sound field grid based on a RectGrid function in an acicular library, wherein the sound field grid is used for representing a possible range of a sound source position;
step4.2: reading multichannel audio signals acquired by the microphone array from a designated path, and obtaining the sampling rate of the audio signals;
step4.3: converting the multichannel audio signals acquired by the microphone array into an h5 format;
step4.4: framing the audio signal, windowing each frame, performing fast Fourier transform on each frame, and finally calculating the power spectrum of each frame specifically as follows:
in the formula (4), x i Is the ith frame of signal x;
step4.5: drawing microphone positions, and drawing position information of a microphone array on a graph;
step4.6: obtaining a sound field Environment based on an Environment function in an aclar library;
step4.7: steering vectors for sound source localization are created using the sound field environment, microphone array information, and the sound field grid.
Step4.8: and realizing beam forming based on the frequency spectrum data and the steering vector to obtain a beam forming result.
Step4.9: the sound pressure level of the beam forming result is calculated and the result is displayed on a thermodynamic diagram, specifically:
SPL=20*log 10 (P/Pref) (5)
in the formula (5), P is the sound pressure to be calculated, P ref A reference sound pressure is used as a reference;
step4.10: sound source position information is obtained from the thermodynamic diagram in step4.9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311134116.XA CN117169812A (en) | 2023-09-05 | 2023-09-05 | Sound source positioning method based on deep learning and beam forming |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311134116.XA CN117169812A (en) | 2023-09-05 | 2023-09-05 | Sound source positioning method based on deep learning and beam forming |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117169812A true CN117169812A (en) | 2023-12-05 |
Family
ID=88940642
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311134116.XA Pending CN117169812A (en) | 2023-09-05 | 2023-09-05 | Sound source positioning method based on deep learning and beam forming |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117169812A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117368847A (en) * | 2023-12-07 | 2024-01-09 | 深圳市好兄弟电子有限公司 | Positioning method and system based on microphone radio frequency communication network |
-
2023
- 2023-09-05 CN CN202311134116.XA patent/CN117169812A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117368847A (en) * | 2023-12-07 | 2024-01-09 | 深圳市好兄弟电子有限公司 | Positioning method and system based on microphone radio frequency communication network |
CN117368847B (en) * | 2023-12-07 | 2024-03-15 | 深圳市好兄弟电子有限公司 | Positioning method and system based on microphone radio frequency communication network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109597022B (en) | Method, device and equipment for calculating azimuth angle of sound source and positioning target audio | |
US10127922B2 (en) | Sound source identification apparatus and sound source identification method | |
CN107577449B (en) | Wake-up voice pickup method, device, equipment and storage medium | |
JP4248445B2 (en) | Microphone array method and system, and voice recognition method and apparatus using the same | |
CN111445920B (en) | Multi-sound source voice signal real-time separation method, device and pickup | |
CN109427328B (en) | Multichannel voice recognition method based on filter network acoustic model | |
CN111044973A (en) | MVDR target sound source directional pickup method for microphone matrix | |
KR20130084298A (en) | Systems, methods, apparatus, and computer-readable media for far-field multi-source tracking and separation | |
CN106872945B (en) | Sound source positioning method and device and electronic equipment | |
CN112904279B (en) | Sound source positioning method based on convolutional neural network and subband SRP-PHAT spatial spectrum | |
CN108109617A (en) | A kind of remote pickup method | |
CN110875056B (en) | Speech transcription device, system, method and electronic device | |
CN110610718B (en) | Method and device for extracting expected sound source voice signal | |
CN111798869B (en) | Sound source positioning method based on double microphone arrays | |
CN117169812A (en) | Sound source positioning method based on deep learning and beam forming | |
CN112394324A (en) | Microphone array-based remote sound source positioning method and system | |
CN111031463A (en) | Microphone array performance evaluation method, device, equipment and medium | |
CN112859000B (en) | Sound source positioning method and device | |
CN112037813B (en) | Voice extraction method for high-power target signal | |
CN116559778B (en) | Vehicle whistle positioning method and system based on deep learning | |
CN113870893A (en) | Multi-channel double-speaker separation method and system | |
CN114830686A (en) | Improved localization of sound sources | |
CN112363112A (en) | Sound source positioning method and device based on linear microphone array | |
CN114927141B (en) | Method and system for detecting abnormal underwater acoustic signals | |
CN116153324A (en) | Virtual array expansion beam forming method based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |