CN110838303A - Voice sound source positioning method using microphone array - Google Patents

Voice sound source positioning method using microphone array Download PDF

Info

Publication number
CN110838303A
CN110838303A CN201911069273.0A CN201911069273A CN110838303A CN 110838303 A CN110838303 A CN 110838303A CN 201911069273 A CN201911069273 A CN 201911069273A CN 110838303 A CN110838303 A CN 110838303A
Authority
CN
China
Prior art keywords
time
voice
signal
frequency
sound
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911069273.0A
Other languages
Chinese (zh)
Other versions
CN110838303B (en
Inventor
王浩
卢晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201911069273.0A priority Critical patent/CN110838303B/en
Publication of CN110838303A publication Critical patent/CN110838303A/en
Application granted granted Critical
Publication of CN110838303B publication Critical patent/CN110838303B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Abstract

The invention discloses a method for positioning a voice sound source by using a microphone array, which comprises the following steps: (1) generating a training sample to obtain a time-frequency domain signal and obtain a power envelope; (2) judging whether each time-frequency point of the time-frequency domain signal is a direct voice signal; (3) training a neural network of a UNET structure by using the sample generated in the step (1); (4) predicting a time-frequency point corresponding to a direct sound of a to-be-detected noise-containing signal voice by using a trained neural network with a UNET structure; (5) and applying a positioning method to the time-frequency point which is judged to be the direct voice sound to obtain a positioning result. The voice sound source positioning method can effectively remove the influence of interference and reverberation in the environment with high reverberation and high interference, and obtain the result with higher accuracy and robustness.

Description

Voice sound source positioning method using microphone array
Technical Field
The invention relates to a voice sound source positioning method using a microphone array under a high-interference and high-reverberation environment based on a UNET structure, and belongs to the technical field of voice signal processing.
Background
The purpose of Speech Signal Source Localization (SSL) is to estimate the angle (DOA) at which the Speech signal reaches the microphone array. Sound localization, or DOA estimation, of speech signals using a microphone array is a very important and hot topic in acoustic signal processing. The method plays a very important role in sound capture in many application scenarios, such as man-machine voice interaction, lens tracking and intelligent monitoring of intelligent devices. However, there is a difficulty in that the speech signal is a broadband, non-stationary random process, while there is also noise floor, reverberation and other interfering sound sources.
Classical sound source localization methods can be divided into TDOA (time Delay Of arrival), SRP (SteeredResponse Power) and Spatial Spectrum; the data-driven method mainly utilizes a convolutional neural network to directly obtain the DOA result. In a large number of practical application scenes, not only reverberation but also noise interference exists, and most of the existing methods cannot keep high accuracy and robustness in such a complex environment.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a voice sound source positioning method using a microphone array, which can still obtain results with higher accuracy and robustness in the environment with high reverberation and high interference.
In order to achieve the purpose, the invention adopts the technical scheme that:
a method for locating a voice sound source using a microphone array, comprising the steps of:
step 1, collecting voice signals and interference signals by using a microphone array, obtaining time-frequency domain signals of noise-containing voice signals and clean voice signals, and calculating power spectrum amplitude logarithm values of the noise-containing voice signals and the clean voice signals; the clean voice signal is a signal only composed of direct voice sound;
step 2, respectively calculating respective space power response spectrums of all time-frequency points in the time-frequency domain of the noise-containing voice signal and the clean voice signal, further estimating time delay corresponding to the time-frequency points, and recording the time delay corresponding to the time-frequency points
Figure BDA0002260422660000011
And
Figure BDA0002260422660000012
respectively, a noisy speech signal and a clean speech signalA time-frequency window delay estimate corresponding to time n and frequency band k; obtaining a time-frequency point distribution diagram corresponding to the direct voice sound;
step 3, training a neural network of the UNET structure by using the power spectrum amplitude logarithmic value of the noise-containing voice signal and the clean voice signal in the step 1 and the time-frequency point distribution diagram corresponding to the direct voice sound in the step 2; estimating a time-frequency point distribution diagram corresponding to the voice direct sound of the signal to be detected by using the power spectrum amplitude logarithmic value of the signal to be detected and the trained neural network;
and 4, obtaining a voice sound source positioning result by using the voice direct sound distribution estimated in the step 3 as a weight and combining a weighted positioning algorithm.
Further, in the step 2, selecting the time-frequency distribution points corresponding to the direct sound should satisfy the following conditions at the same time:
1) time delay estimation in noisy speech signals
Figure BDA0002260422660000021
Differs from the real time delay tau (dsin theta)/c by less than a threshold value TH1D, c and theta are the distance of the microphone, the sound velocity and the angle of the voice source reaching the array respectively;
2) in clean speech signals, time delay estimationThe difference of the real time delay tau is less than a threshold value TH1A time-frequency window of (d);
3) the spatial power spectral response correlation of the same position of the noise-containing voice signal and the clean voice signal is greater than a threshold value TH2Time-frequency window of (d).
Further, in step 3, the input of the neural network is a logarithmic power spectrogram of the noisy speech signal, and the output is a clean speech signal logarithmic power spectrogram and a speech direct sound time-frequency point distribution diagram, wherein the clean speech signal logarithmic power spectrogram is used for assisting training, and a value corresponding to the speech direct sound time-frequency point distribution diagram is used as a weight value of the time-frequency point of step 4.
According to the voice sound source positioning method, the direct voice component is positioned, so that interference and reverberation components are reduced to the greatest extent to participate in positioning, results with high accuracy and robustness can still be obtained in the environment with high reverberation and high interference, and the influence of interference noise on the positioning effect is effectively avoided. The neural network with the UNET structure firstly performs down-sampling, learns the deep features of input data through convolution of different degrees, and performs deconvolution to realize up-sampling fitting of original data and output data features. The network structure has strong feature learning capability in a deep neural network, and is suitable for learning voice features and judging direct sound. The UNET network model adopted in the invention can be used in different array shapes, and because the UNET network predicts single-channel signals, the UNET network does not need to retrain the arrays in different shapes in actual use.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a basic block diagram of an UNET network in an embodiment of the present invention;
FIG. 3 is a logarithmic power spectrum of a noisy speech signal;
FIG. 4 is a log power spectrum of a clean speech signal;
FIG. 5 is a graph of theoretical direct speech sound time-frequency distribution;
FIG. 6 is a distribution diagram of UNET predicted direct speech sound time-frequency points;
fig. 7 shows unweighted noisy signals, theoretically weighted noisy signals, predicted spatial power responses of weighted noisy signals of direct sound (the left peak corresponds to a speech signal, and the right peak corresponds to an interference signal, which are normalized according to the maximum value).
Detailed Description
The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.
The embodiment is performed in simulation, and provides a method for positioning a voice sound source by using a microphone array based on a UNET structure, which is suitable for high-interference and high-reverberation environments and is also suitable for arrays with different shapes, and the method comprises the following steps:
1. generating training samples to obtain time-frequency domain signals and obtaining power envelopes.
Arranging voice or interference sound sources in a simulation room, collecting signals by using I microphones, respectively collecting voice signals and interference signals at different positions, superposing the voice signals and the interference signals on a time domain to form noisy voice signals, normalizing the signal amplitude according to the maximum value, and obtaining time-frequency domain signals thereof through STFT (space time transform), wherein the time-frequency domain signals are recorded as xi(n, k) representing the noise-containing speech signal of the nth frame of the ith microphone in the frequency band k, and the speech time-frequency domain signal received by the microphone array before superposition is recorded as si(n, k). The mean values of the power spectral magnitudes of noisy speech signal X (n, k) and clean speech signal S (n, k) are respectively expressed as:
Figure BDA0002260422660000031
Figure BDA0002260422660000032
in the formula, xι(n, k) and sιAnd (n, k) respectively represent single-channel signals in the noise-containing voice signal and the clean voice signal received by the microphone array. Respectively represent the logarithmization as
XL(n,k)=log10(X(n,k)+ξ) (3)
SL(n,k)=log10(S(n,k)+ξ) (4)
In the formula, XL(n, k) and SL(n, k) represent the power spectrum amplitude logarithm of the noisy signal and the clean speech signal respectively, and ξ is a background noise power estimation value used for reducing the influence of the background noise on the robustness of the invention.
2. And judging the direct voice signal.
In a real scenario there will always be some environmental disturbances and room reverberation, which has a bad influence on the speech localization. The judgment of the direct voice is beneficial to improving the accuracy and robustness of the voice source positioning, and the influence of interference and reverberation is effectively eliminated.
For any time-frequency point in the time-frequency domain of the voice signal and the voice signal containing noise, the respective space power response spectrum P is calculated by utilizing a controllable response power (SRP) algorithmX(τ | n, k) and PS(τ|n,k)
Figure BDA0002260422660000033
Where X (n, k) and S (n, k) are the multi-channel frequency domain noisy speech and clean speech signals corresponding to time n and frequency band k, respectively, and X (n, k) ═ X1(n,k),x2(n,k),...,xI(n,k)]T,S(n,k)=[s1(n,k),s2(n,k),...,sI(n,k)]TThe superscript "H" is the conjugate transposed symbol, "T" is the transposed symbol, and g (k, τ) is the steering vector corresponding to the frequency band k delay τ. After obtaining the space power response, further estimating the time delay (TDoA) corresponding to the point
Figure BDA0002260422660000041
Figure BDA0002260422660000042
In the formula (I), the compound is shown in the specification,
Figure BDA0002260422660000043
and
Figure BDA0002260422660000044
the time-frequency window delay estimates for noisy speech and clean speech signals corresponding to time n and frequency band k, respectively, and argmax represents the value of the argument that corresponds to the maximum value of the expression.
Extracting time-frequency distribution points corresponding to the direct sound as follows:
1) estimating time delay in corresponding time-frequency window in noisy speech signal
Figure BDA0002260422660000045
Differs from the real time delay tau (dsin theta)/c by less than a threshold value TH1D, c and theta are the distance of the microphone, the sound velocity and the angle of the voice source reaching the array respectively;
2) in speech signals, the time delay in the corresponding time-frequency window is estimated
Figure BDA0002260422660000046
The difference of the real time delay tau is less than a threshold value TH1A time-frequency window of (d);
3) spatial power spectrum response P of two groups of signals at same positionX(τ | n, k) and PS(τ | n, k) correlation is greater than a threshold value TH2Time-frequency window of (d).
The center point of the time-frequency window which meets the three conditions is marked as 1, otherwise, the center point is 0. And obtaining the time-frequency point distribution diagram of the direct voice sound.
3. And training the UNET structure.
The basic block diagram of UNET used in this embodiment is shown in fig. 2. Cnn (K), decnn (K) are Convolutional Neural Network Layer (Convolutional Neural Network Layer) and Convolutional Neural Network Layer (Convolutional Neural Network Layer) of K channel, respectively, and the activation functions are all leakage ReLU (lreul). The length and the width of the matrix unit in the layer are expanded by the latter, the matrix unit corresponds to Max scaling, which means the maximum pooling layer, the length and the width of the matrix unit in the layer are reduced, and the matrix unit corresponds to the UNET decoding and encoding processes respectively, and the expansion or the reduction is doubled in the invention. Input in UNET structure is logarithmized power spectrogram X of noise-containing voice signalL(n, k), two outputs Speech (S) and DPD (D), which are respectively a logarithmic power spectrogram S of direct voice soundL(n, k) and the time-frequency point distribution diagram of the direct voice sound obtained in the step 2. Both Input and Speech spectrogram information is derived from data acquired by a single microphone, regardless of the array structure, so that the model can be applied to different types of arrays.
The UNET neural network cost function is min (1-lambda) | S | -S | | sweet wind2+λ||D*-D||2(9)
In the formula, S and D are predicted values of S and D output by the neural network, | · U.T. respectively2Representing a second order norm, λ is 0 at the beginning of training and gradually increases to 1 as training progresses.
4. And predicting the direct voice of the to-be-detected noise-containing signal by using the UNET structure.
When the UNET neural network is used, only the logarithmized power spectrogram X of a noise-containing voice signal is Input in the InputL(n, k), namely, obtaining a time-frequency point distribution diagram of the direct voice sound in Output, wherein a corresponding value in the distribution diagram can be used as a weight of a time-frequency point used in the following process and is marked as W (n, k).
5. A positioning result is obtained using a weighted controlled response power (WSRP) algorithm.
Here, the usual positioning methods can be used: and the SRP method is used for positioning the selected time-frequency point. Because the time-frequency points need to be screened, the WSRP method is adopted in this embodiment, and the final positioning result is expressed as
Figure BDA0002260422660000051
Where g (k, theta) is a steering vector corresponding to the delay theta of the frequency band k, theta represents a possible value of the sound wave arrival direction, i.e. an independent variable,
Figure BDA0002260422660000052
representing the direction of arrival of the acoustic wave to be estimated. The microphone array may be any suitable array, typically a line array or a ring array is used. If the microphone uses a uniform line array, g (k, θ) is expressed as
Figure BDA0002260422660000053
Where exp denotes an index based on a natural logarithm e, j denotes an imaginary variable, c denotes a sound velocity, d is a distance vector of the microphone array, and ω iskIs the angular frequency corresponding to band k.
At this point, a voice sound source localization result is obtained.
An example of a simulation is given below.
1. Simulated hybrid speech generation
The present implementation takes the positioning of the simulation signal as an example. And during simulation, generating room impulse response by using Imagemodel, convolving the room impulse response with clean voice to generate voice under a reverberation environment, and convolving and superposing the room impulse response generated by Imagemodel at different sound source positions with the clean voice with the same room parameters to obtain a mixed signal. When Imagemodel is used for simulation, the microphone array adopts a 4-channel line array, the unit spacing of the microphone array is 2cm during network training, the prediction time distance is 3.5cm, and the room size is set to be 7 multiplied by 5 multiplied by 3m3Obtaining nearby randomly; the target sound source is positioned at 60 degrees, 45 degrees and 30 degrees on the left side of the array, the distance from the center of the array is 2m, and the interference sound source is positioned at 45 degrees on the right side of the array; the room reverberation time is randomly selected between 0.2s and 0.9s, and the signal-to-interference ratio is randomly selected between-5 dB and 10 dB. Each speech sample is 1.2s in length. The sampling frequency of the signal is 16 KHz. And respectively collecting the voice signals and the interference signals at different positions, and superposing the voice signals and the interference signals on a time domain to form noisy voice signals. And the reflectivity of the wall surface of the room is all 0 when the voice direct sound signal is collected. Since a single channel signal is used in training, the selection of the shape of the array and the location of the sound source has a negligible effect on the network training.
2. Method process flow
a) Parameter setting
The parameters of the process of the invention are first given in table 1. It should be noted that the method of the present invention does not require adjustment of parameters in different environments, and the given parameters can be applied in various environments.
TABLE 1 respective parameters
Parameters Values
Window width 512
Frameshift 256
ξ 1×10-4
c 344m/s
TH1 d/(15c)
TH2 0.98
Range of the frequency band [2000Hz,8000Hz]
b) Short time Fourier transform
And (3) performing discrete short-time Fourier transform on the time domain signal acquired by the microphone to obtain a time-frequency domain signal, wherein the window function is a Hanning window, the window length is 32ms, and the window shift is 16 ms.
c) Computing an "energy" envelope
Each time-frequency point of the time-frequency domain signal: the logarithmized power spectrum amplitude is calculated using the equations (1) - (4).
d) Selecting time-frequency points corresponding to direct voice sound
Each time-frequency point of the time-frequency domain signal: and (3) calculating the spatial power response and the time delay by using the formulas (5) to (8), and judging whether the sound is direct sound according to 3 conditions in the step 2.
e) Training designed UNET structures with generated samples
For the designed UNET structure:
1. input is a logarithmized power spectrum X of a noisy speech signalL(n, k), two outputs Speech (S) and DPD (D), which are respectively a logarithmic power spectrogram S of direct voice soundL(n, k) and the time-frequency point distribution map of the direct voice sound obtained in the step 1.2;
see (9) for the UNET neural network cost function.
f) Predicting direct sound
For the trained UNET structure: and inputting a logarithmic power spectrogram of the noise-containing voice signal at Input, and obtaining a time-frequency point distribution diagram of the direct voice sound in Output.
g) The method of weighting controllable response power is applied to the selected time-frequency point to obtain the positioning result
Each time-frequency point of the time-frequency domain signal: the final positioning result is estimated using equation (10).
In order to illustrate the advantages of the method, the method is compared and verified with the common traditional algorithm SRP-PHAT by using simulation and experiment.
Under simulation conditions, 60 sets of data were tested in each direction using a 4-channel line array. The experimental conditions were the same as those in the simulation example. Fig. 3-7 show that after the noisy speech signal is processed by the method of the present invention, the energy in the interference signal (right side) is greatly reduced in the spatial power response, and the influence of the interference on the positioning is greatly reduced.
A positioning result differing from the true angle by less than 5 ° is defined here as a valid positioning. Table 2 shows the effective positioning rate of the method and the conventional algorithm SRP-PHAT in the test set, and the effectiveness of the positioning effect can be obviously seen.
TABLE 2 comparison of effective location results
Angle (°) The method of the invention SRP-PHAT
-30 68.33% 23.33%
-45 75% 18.33%
-60 55% 15%
In the experiment, the test was performed in two rooms: room 1 is a small Room with high reverberation, and has a volume of 5.2 × 3.5 × 3m, and T60 ═ 1.10 s; room 2 Audio-visual Room, volume 7.3X 5.3X 3m3T60 ═ 0.36 s; 50 voice samples are recorded by using a 4-channel line array with the spacing of 3.5cm, interference samples containing 20 different common noises are played circularly in a recording environment, and the distances between a sound source and the microphone array are expected to be 2 meters and the heights are the same. The sampling rate is 16 KHz. The speech sound source is at-30 deg. -45 deg. -60 deg. respectively, and the interfering sound source is at 45 deg.. The signal-to-interference ratio stabilizes at 3dB to be practical.
TABLE 3 comparison of RMSE (. degree.) for different methods in the experiment
Figure BDA0002260422660000071
Simulation and experiments show that the method provided by the invention is superior to the SRP-PHAT method in accuracy and robustness, the method is more stable under the condition of high reverberation, and the maximum RMSE in the experiment is 3.69 degrees and is far lower than that of the traditional SRP-PHAT algorithm.

Claims (3)

1. A method for locating a voice sound source using a microphone array, comprising the steps of:
step 1, collecting voice signals and interference signals by using a microphone array, obtaining time-frequency domain signals of noise-containing voice signals and clean voice signals, and calculating power spectrum amplitude logarithm values of the noise-containing voice signals and the clean voice signals; the clean voice signal is a signal only composed of direct voice sound;
step 2, respectively calculating respective space power response spectrums of all time-frequency points in the time-frequency domain of the noise-containing voice signal and the clean voice signal, further estimating time delay corresponding to the time-frequency points, and recording the time delay corresponding to the time-frequency points
Figure FDA0002260422650000011
And
Figure FDA0002260422650000012
time-frequency window delay estimation values of the noisy speech signal and the clean speech signal corresponding to time n and frequency band k respectively; obtaining a time-frequency point distribution diagram corresponding to the direct voice sound;
step 3, training a neural network of the UNET structure by using the power spectrum amplitude logarithmic value of the noise-containing voice signal and the clean voice signal in the step 1 and the time-frequency point distribution diagram corresponding to the direct voice sound in the step 2; estimating a time-frequency point distribution diagram corresponding to the voice direct sound of the signal to be detected by using the power spectrum amplitude logarithmic value of the signal to be detected and the trained neural network;
and 4, obtaining a voice sound source positioning result by using the voice direct sound distribution estimated in the step 3 as a weight and combining a weighted positioning algorithm.
2. The method as claimed in claim 1, wherein the selecting of the time-frequency distribution points corresponding to the direct sound in step 2 satisfies the following conditions:
1) time delay estimation in noisy speech signals
Figure FDA0002260422650000013
Differs from the real time delay tau (dsin theta)/c by less than a threshold value TH1D, c and theta are the distance of the microphone, the sound velocity and the angle of the voice source reaching the array respectively;
2) in clean speech signals, time delay estimation
Figure FDA0002260422650000014
The difference of the real time delay tau is less than a threshold value TH1A time-frequency window of (d);
3) the spatial power spectral response correlation of the same position of the noise-containing voice signal and the clean voice signal is greater than a threshold value TH2Time-frequency window of (d).
3. The method as claimed in claim 1, wherein in the step 3, the input of the neural network is a log-quantized power spectrum of the noisy speech signal, and the output is a log-quantized power spectrum of the clean speech signal and a time-frequency point distribution map of the direct speech sound, wherein the log-quantized power spectrum of the clean speech signal is used for training assistance, and a value corresponding to the time-frequency point distribution map of the direct speech sound is used as the weight value of the time-frequency point in the step 4.
CN201911069273.0A 2019-11-05 2019-11-05 Voice sound source positioning method using microphone array Active CN110838303B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911069273.0A CN110838303B (en) 2019-11-05 2019-11-05 Voice sound source positioning method using microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911069273.0A CN110838303B (en) 2019-11-05 2019-11-05 Voice sound source positioning method using microphone array

Publications (2)

Publication Number Publication Date
CN110838303A true CN110838303A (en) 2020-02-25
CN110838303B CN110838303B (en) 2022-02-08

Family

ID=69576300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911069273.0A Active CN110838303B (en) 2019-11-05 2019-11-05 Voice sound source positioning method using microphone array

Country Status (1)

Country Link
CN (1) CN110838303B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium
CN112269158A (en) * 2020-10-14 2021-01-26 南京南大电子智慧型服务机器人研究院有限公司 Method for positioning voice source by utilizing microphone array based on UNET structure

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184730A (en) * 2011-02-17 2011-09-14 南京大学 Feed-forward active noise barrier
US20180018970A1 (en) * 2016-07-15 2018-01-18 Google Inc. Neural network for recognition of signals in multiple sensory domains
CN107703486A (en) * 2017-08-23 2018-02-16 南京邮电大学 A kind of auditory localization algorithm based on convolutional neural networks CNN
RU2659100C1 (en) * 2017-06-05 2018-06-28 Федеральное Государственное Казенное Военное Образовательное Учреждение Высшего Образования "Тихоокеанское Высшее Военно-Морское Училище Имени С.О. Макарова" Министерства Обороны Российской Федерации (Г. Владивосток) Large-scale radio-hydro acoustic system formation and application method for monitoring, recognizing and classifying the fields generated by the sources in marine environment
US20180341838A1 (en) * 2017-05-23 2018-11-29 Viktor Prokopenya Increasing network transmission capacity and data resolution quality and computer systems and computer-implemented methods for implementing thereof
CN109410273A (en) * 2017-08-15 2019-03-01 西门子保健有限责任公司 According to the locating plate prediction of surface data in medical imaging
US20190104357A1 (en) * 2017-09-29 2019-04-04 Apple Inc. Machine learning based sound field analysis
CN109754812A (en) * 2019-01-30 2019-05-14 华南理工大学 A kind of voiceprint authentication method of the anti-recording attack detecting based on convolutional neural networks
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network
CN110068795A (en) * 2019-03-31 2019-07-30 天津大学 A kind of indoor microphone array sound localization method based on convolutional neural networks
CN110333494A (en) * 2019-04-10 2019-10-15 马培峰 A kind of InSAR timing deformation prediction method, system and relevant apparatus

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184730A (en) * 2011-02-17 2011-09-14 南京大学 Feed-forward active noise barrier
US20180018970A1 (en) * 2016-07-15 2018-01-18 Google Inc. Neural network for recognition of signals in multiple sensory domains
US20180341838A1 (en) * 2017-05-23 2018-11-29 Viktor Prokopenya Increasing network transmission capacity and data resolution quality and computer systems and computer-implemented methods for implementing thereof
RU2659100C1 (en) * 2017-06-05 2018-06-28 Федеральное Государственное Казенное Военное Образовательное Учреждение Высшего Образования "Тихоокеанское Высшее Военно-Морское Училище Имени С.О. Макарова" Министерства Обороны Российской Федерации (Г. Владивосток) Large-scale radio-hydro acoustic system formation and application method for monitoring, recognizing and classifying the fields generated by the sources in marine environment
CN109410273A (en) * 2017-08-15 2019-03-01 西门子保健有限责任公司 According to the locating plate prediction of surface data in medical imaging
CN107703486A (en) * 2017-08-23 2018-02-16 南京邮电大学 A kind of auditory localization algorithm based on convolutional neural networks CNN
US20190104357A1 (en) * 2017-09-29 2019-04-04 Apple Inc. Machine learning based sound field analysis
CN109839612A (en) * 2018-08-31 2019-06-04 大象声科(深圳)科技有限公司 Sounnd source direction estimation method based on time-frequency masking and deep neural network
CN109754812A (en) * 2019-01-30 2019-05-14 华南理工大学 A kind of voiceprint authentication method of the anti-recording attack detecting based on convolutional neural networks
CN110068795A (en) * 2019-03-31 2019-07-30 天津大学 A kind of indoor microphone array sound localization method based on convolutional neural networks
CN110333494A (en) * 2019-04-10 2019-10-15 马培峰 A kind of InSAR timing deformation prediction method, system and relevant apparatus

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YONGLIANG SUN,ET AL.: "Human Localization Using Multi-Source Heterogeneous Data in Indoor Environments", 《IEEE ACCESS》 *
宋建国等: "改进的神经网络级联相关算法及其在初至拾取中的应用", 《石油地球物理勘探》 *
王浩等: "基于UNET直达声判决的鲁棒性语音源定位", 《2019年全国声学大会论文集 》 *
谢庆等: "基于多特征量的油中局放超声直达波识别研究", 《中国电机工程学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111312273A (en) * 2020-05-11 2020-06-19 腾讯科技(深圳)有限公司 Reverberation elimination method, apparatus, computer device and storage medium
CN112269158A (en) * 2020-10-14 2021-01-26 南京南大电子智慧型服务机器人研究院有限公司 Method for positioning voice source by utilizing microphone array based on UNET structure

Also Published As

Publication number Publication date
CN110838303B (en) 2022-02-08

Similar Documents

Publication Publication Date Title
CN109839612B (en) Sound source direction estimation method and device based on time-frequency masking and deep neural network
Kim et al. Generation of Large-Scale Simulated Utterances in Virtual Rooms to Train Deep-Neural Networks for Far-Field Speech Recognition in Google Home.
CN107452389B (en) Universal single-track real-time noise reduction method
CN106782590A (en) Based on microphone array Beamforming Method under reverberant ambiance
CN110726972B (en) Voice sound source positioning method using microphone array under interference and high reverberation environment
CN101667425A (en) Method for carrying out blind source separation on convolutionary aliasing voice signals
Raykar et al. Speaker localization using excitation source information in speech
Niwa et al. Post-filter design for speech enhancement in various noisy environments
CN110838303B (en) Voice sound source positioning method using microphone array
CN112904279A (en) Sound source positioning method based on convolutional neural network and sub-band SRP-PHAT space spectrum
CN113129918A (en) Voice dereverberation method combining beam forming and deep complex U-Net network
Pertilä et al. Microphone array post-filtering using supervised machine learning for speech enhancement.
CN114171041A (en) Voice noise reduction method, device and equipment based on environment detection and storage medium
CN110111802A (en) Adaptive dereverberation method based on Kalman filtering
CN112269158B (en) Method for positioning voice source by utilizing microphone array based on UNET structure
CN115424627A (en) Voice enhancement hybrid processing method based on convolution cycle network and WPE algorithm
CN111123202B (en) Indoor early reflected sound positioning method and system
Pirhosseinloo et al. A new feature set for masking-based monaural speech separation
Guo et al. Underwater target detection and localization with feature map and CNN-based classification
Firoozabadi et al. Combination of nested microphone array and subband processing for multiple simultaneous speaker localization
CN115426055A (en) Noise-containing underwater acoustic signal blind source separation method based on decoupling convolutional neural network
CN101645701B (en) Time delay estimation method based on filter bank and system thereof
CN112712818A (en) Voice enhancement method, device and equipment
Sarabia et al. Spatial LibriSpeech: An Augmented Dataset for Spatial Audio Learning
JP2005258215A (en) Signal processing method and signal processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant