CN113470682B - Method, device and storage medium for estimating speaker azimuth by microphone array - Google Patents
Method, device and storage medium for estimating speaker azimuth by microphone array Download PDFInfo
- Publication number
- CN113470682B CN113470682B CN202110664316.0A CN202110664316A CN113470682B CN 113470682 B CN113470682 B CN 113470682B CN 202110664316 A CN202110664316 A CN 202110664316A CN 113470682 B CN113470682 B CN 113470682B
- Authority
- CN
- China
- Prior art keywords
- array
- speaker
- sequence
- frequency domain
- azimuth
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000001228 spectrum Methods 0.000 claims abstract description 65
- 238000005070 sampling Methods 0.000 claims description 25
- 238000004590 computer program Methods 0.000 claims description 6
- 230000005540 biological transmission Effects 0.000 claims description 5
- 238000010248 power generation Methods 0.000 claims description 2
- 230000004807 localization Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000004378 air conditioning Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L21/0232—Processing in the frequency domain
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Abstract
The application discloses a method, a device and a storage medium for estimating speaker azimuth by using a microphone array. The method comprises the following steps: s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence; s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array; s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor; and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker. The speaker position estimation method and the speaker position estimation device can accurately estimate the speaker position in the noise environment.
Description
Technical Field
The application belongs to the field of voice, in particular to the field of microphone array pickup, and relates to a method, a device and a storage medium for estimating the speaker direction by using a microphone array.
Background
Along with the continuous breakthrough and development of artificial intelligence theory, artificial intelligence has made great progress in the speech field, and the branch directions of various speech signal processes such as echo cancellation, speech enhancement, speech recognition, voiceprint recognition, semantic analysis, semantic understanding are rapidly developed, and these branch directions all face a common problem, namely how to pick up high-quality speech signals. At present, in order to improve the quality of voice signal pickup, a microphone array is used to pick up voice signals in a specific direction of a speaker with high quality, and the estimation of the prescribed position of the speaker becomes a key ring of the microphone array to pick up voice with high quality. The speaker's azimuth estimation, which is affected by both ambient reverberation and ambient noise, usually has a large error.
At present, the most classical method for estimating speaker position by using a microphone array is the SRP-PHAT (Steered-Response Power-Phase Transformation) method, and the SRP-PHAT method has good suppression capability to a reverberant environment, but environmental noise slightly occurs, the SRP-PHAT method can erroneously estimate the noise position, and the SRP-PHAT method cannot suppress the environmental noise. The microphone array direction of arrival estimation based on the SRP-phas method as proposed in chinese patent application 201410366922.4, in the presence of noise sources, will not be able to estimate the speaker's position but the noise source's position, which has drawbacks in the noisy environment.
Fig. 1 shows the array element position diagram of a ternary microphone array, three microphones spaced 120 degrees apart and uniformly distributed on a circumference with a radius of 4 cm. As shown in fig. 2 and 3, this microphone array is used for speaker localization using the SRP-phas method according to two scenarios. Scene 1 shown in fig. 2: the speaker is 3 meters away from the array, and sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, and no noise source is opened; scene 2 shown in fig. 3: the speaker is 3 meters away from the array, sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, opens a noise source at the position of 200 degrees and 3 meters, plays white noise, and has a signal-to-noise ratio of 5dB. As shown in fig. 4, in case of scenario 1, the SRP-phas method can estimate the speaker's bearing without a noise source. As shown in fig. 5, in the case of a noise source, the SRP-phas method cannot estimate the speaker's position, and in more cases, estimates the position of the noise source.
By combining the comparison of fig. 4 and fig. 5, we can see that the SRP-phas method can effectively estimate the speaker's azimuth under the reverberant environment, but the SRP-phas method cannot estimate the speaker's azimuth and more estimates the azimuth of the noise source under the condition that the noise source appears.
Summarizing, (1) existing speaker position estimation methods, like the SRP (stered-Response Power) method, can fail speaker position estimation in a scene where reverberation occurs. In order to solve the anti-reverberation performance of the speaker position estimation method, many scholars propose a classical SRP-phas (steored-Response Power-Phase Transformation) method for speaker localization in a reverberant environment, and the SRP-phas method achieves a more ideal localization effect in the reverberant environment. (2) In the existing speaker position estimation method, in the scene of the occurrence of a noise source, the speaker position cannot be estimated accurately, particularly in the classical speaker positioning method, namely the SRP-PHAT method, the position estimated by the SRP-PHAT method is found to be the position of the noise source and the speaker position cannot be positioned under the condition of the occurrence of the noise source. This causes the SRP-PHAT method to fail in speaker localization in an environment where noise sources are present.
Aiming at the situation that the SRP-PHAT method cannot be applied to a noise environment, the method and the device for estimating the speaker orientation by utilizing the microphone array are provided, so that the microphone array can accurately estimate the speaker orientation under the reverberation and noise environments, and the purpose of picking up the voice of the speaker with high quality is achieved.
Disclosure of Invention
The application aims to provide a method, a device and a storage medium for estimating the speaker position by using a microphone array, which can accurately estimate the speaker position in a noise environment.
According to a first aspect of the present application, there is provided a method of estimating a speaker's position using a microphone array, comprising the steps of:
s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence;
s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor;
and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker.
In some embodiments, the microphone array includes P array elements, the microphone array collects a frame length of a frame of signal, where the frame length of a frame of signal is L, and performs an M-point Fast Fourier Transform (FFT) operation on a frame of signal with a length of L, where M represents the number of points where M is FFT;
one frame signal x collected by array element p p,t The definition is as follows:
x p,t =[x p,t,1 x p,t,2 … x p,t,L ] T
wherein, P represents the array element number, t=1, 2, …, P; t represents the number of frames, the T frame, t=1, 2, …, T is the total number of frames; x is x p,t,l Representing the signal time domain sample values of the p-th array element, the t-th frame and the first sampling moment;
after M-point FFT operation, the frame signal obtains a signal spectrum sequence X p,t The following are provided:
X p,t =[X p,t,1 X p,t,2 … x p,t,M ] T
wherein X is p,t,m Representing the signal spectrum sample values of the p-th array element, the t-th frame and the m-th frequency sampling point.
In some embodiments, step S2 specifically includes:
according to the signal spectrum sequence X p,t Calculating the power spectrum P of the signal acquired by the array element P p,t
P p,t =[P p,t,1 P p,t,2 … P p,t,M ] T
Wherein,P p,t,m signal power spectrum value representing p-th array element, t-th frame and m-th frequency sampling point,/L>Signal spectrum conjugate values representing the q-th array element, the t-th frame and the M-th frequency sampling point, wherein m=1, 2, … and M;
according to the noise amplitude spectrum sequence N of the array element p p,t Calculating noise signal power spectrum Q of array element p p,t The following are provided:
Q p,t =[Q p,t,1 Q p,t,2 … Q p,t,M ] T
wherein Q is p,t,m =N p,t,m ×N p,t,m ,Q p,t,m Noise power spectrum value representing the p-th array element, t-th frame and m-th frequency sampling point, N p,t,m Representing noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;
calculating a frequency domain azimuth estimation factor PT of the mth frequency point between the array element p and the array element q according to the following formula p,q,t,m ,
Wherein P is q,t,m Representing the signal power spectrum value of the Q-th array element, the t-th frame and the m-th frequency sampling point, Q p,t,m Representing the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Q q,t,m Represents the noise power spectrum value of the q-th array element, the t-th frame and the m-th frequency sampling point, and delta is a weighting factor.
In some embodiments, step S3 specifically includes:
let it be assumed that the cross-correlation spectrum G between element p and element q p,q,t The following are provided:
G p,q,t =[G p,q,t,1 G p,q,t,2 … G p,q,t,M ] T
wherein the method comprises the steps ofSignal spectrum sharing representing the q-th array element, the t-th frame and the m-th frequency sampling pointA yoke value;
from the frequency domain bearing estimation factor PT p,q,t,m Calculating cross-correlation power PTG of frequency domain azimuth estimation factors p,q,t The following are provided:
PTG p,q,t =[PTG p,q,t,1 PTG p,q,t,2 … PTG p,q,t,M ] T
wherein PTG p,q,t,m =PT p,q,t,m ×G p,q,t,m ;
Assuming that the azimuth angle of the speaker is theta, defining the transmission delay from the speaker position to the array element p as tau p
Where r is the distance from the array element p to the geometric center of the microphone array, θ p An azimuth angle of the array element p, and c is sound velocity;
transmission delay tau from speaker position to array element p p Defining phase H of array element p at frequency point m p,m The following are provided:
H p,m =exp(-j2πmfτ p )
wherein the frequency domain sampling intervalfs is the system sampling rate;
cross-correlation phase H at frequency point m between element p and element q p,q,m The following are provided:
assume that cross-correlation accumulated power PTGF for selecting a sequence of M1 to M2 frequency points from M frequency points for frequency domain orientation estimation p,q,t And (3) calculating, namely:
cross-correlation accumulated power PTGFS of frequency domain and space domain azimuth estimation t The following are provided:
in some embodiments, in the step S3, the speaker bearing is estimated using the signals of the time domain multiple data frames, and the cross-correlation accumulated power PTGFST (θ) of the time domain, frequency domain and spatial domain bearing estimates is calculated as follows:
in some embodiments, in step S4, the angular interval delta is set within a circumference of 0 to 360 degrees θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of the angle search is N θ The cross-correlation accumulated power PTGFST sequence is as follows:
PTGFST=[PTGFST(Δ θ )PTGFST(2Δ θ )…PTGFST(N θ Δ θ )] T
searching the maximum value of the cross-correlation accumulated power sequence, and finding out the angle corresponding to the maximum value, namely the azimuth estimation value of the speaker, as follows:
in some embodiments, the microphone array is a circular array, and r is the distance from the array element p to the center of the microphone array.
According to a second aspect of the present application there is provided an apparatus for estimating a speaker's position using a microphone array, comprising a microphone array, a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to perform the method as described above when processing speaker signals acquired by the microphone array.
In some embodiments, the processor comprises:
the array noise spectrum estimation module is used for receiving the signals acquired by the microphone array and calculating a noise amplitude spectrum sequence;
the array frequency domain azimuth estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
the cross-correlation accumulated power generation module is used for receiving the frequency domain azimuth estimation factors output by the array frequency domain azimuth estimation factor generation module and calculating a cross-correlation accumulated power sequence by combining a time domain, a frequency domain and a space domain;
the maximum value searching module is used for receiving the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, searching the maximum value of the cross-correlation accumulated power sequence and recording the angle interval corresponding to the maximum value; a kind of electronic device with high-pressure air-conditioning system
And the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.
According to a third aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described above.
Compared with the prior art, the application has the following advantages:
according to the method for estimating the speaker azimuth by using the microphone array, the noise spectrum and the voice spectrum are separated, the weight of the noise spectrum on the azimuth estimation factor is restrained, the contribution weight of the noise spectrum on the azimuth estimation factor is reduced, and the method has an obvious anti-noise effect; the application has good anti-reverberation performance, and can effectively estimate the speaker azimuth under the reverberation environment; the algorithm of the method occupies less resources and is suitable for engineering realization of a low-computation-force platform.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a diagram of array element positions of a ternary microphone array.
Fig. 2 is a plan view of speaker and noise orientations.
Fig. 3 is a three-dimensional view of speaker and noise orientations.
Fig. 4 shows a speaker bearing curve estimated by the SRP-phas method without a noise source.
Fig. 5 shows a speaker bearing curve estimated by the SRP-phas method in the presence of a noise source.
Fig. 6 shows a flow chart of a method according to an embodiment of the application.
Fig. 7 shows a block diagram of an apparatus according to an embodiment of the application.
Fig. 8 shows a speaker bearing curve estimated by the method of an embodiment of the present application under scenario 1.
Fig. 9 shows a speaker bearing curve estimated by the method of an embodiment of the present application in scenario 2.
Detailed Description
Preferred embodiments of the present application will be described in detail below with reference to the attached drawings so that the advantages and features of the present application can be more easily understood by those skilled in the art. The description of these embodiments is provided to assist understanding of the present application, but is not intended to limit the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In accordance with one embodiment of the present application, a method for estimating a speaker using a microphone array is shown in fig. 6, and the flow of the method is described in detail below with reference to fig. 6.
(1) And estimating a noise spectrum.
Assuming that the microphone array has P array elements, the array acquires a frame signal with the frame length of L and performs M-point FFT operation on the frame signal with the length of L. Let the frame signal acquired by the array element p be defined as follows:
x p,t =[x p,t,1 x p,t,2 … x p,t,L ] T
wherein T represents a T-th frame signal (T may be from 1 to T)
After M-point FFT operation, the frame signal obtains the frequency domain sequence as follows:
X p,t =[X p,t,1 X p,t,2 … x p,t,M ] T
the classical noise spectrum estimation method, MCRA (MinimaControlledRecursiveAveraging) method, is adopted, and the estimated M-point noise amplitude spectrum sequence is as follows:
N p,t =[N p,t,1 N p,t,2 … N p,t,M ] T
(2) And calculating an array frequency domain azimuth estimation factor.
Signal spectrum sequence X calculated from array element p p,t The acquired signal power spectrum of the array element p can be calculated as follows:
P p,t =[P p,t,1 P p,t,2 … P p,t,M ] T
wherein the method comprises the steps of
Noise signal amplitude spectrum sequence N estimated according to array element p p,t The noise signal power spectrum of the array element p can be calculated as follows:
Q p,t =[Q p,t,1 Q p,t,2 … Q p,t,M ] T
wherein Q is p,t,m =N p,t,m ×N p,t,m
According to the cross correlation method, calculating a frequency domain azimuth estimation factor between an array element p and an array element q at an mth frequency point, wherein the frequency domain azimuth estimation factor comprises the following steps:
where δ is a weighting factor, typically 0.25 is chosen.
(3) And calculating the accumulated power of the array time domain and the array frequency domain.
The cross-correlation spectrum between element p and element q is assumed as follows:
G p,q,t =[G p,q,t,1 G p,q,t,2 … G p,q,t,M ] T
wherein the method comprises the steps of
From the frequency domain bearing estimation factor PT p,q,t,m The cross-correlation power of the frequency domain bearing estimation factor can be calculated as follows:
PTG p,q,t =[PTG p,q,t,1 PTG p,q,t,2 … PTG p,q,t,M ] T
wherein PTG p,q,t,m =PT p,q,t,m ×G p,q,t,m
Assuming that the azimuth angle of a speaker is theta, assuming that the microphone array is a circular array and the radius from an array element to the center of a circle is r, defining the transmission delay from the speaker position to an array element p as tau p
Wherein θ is p The azimuth of the array element p, c is the sound velocity.
According to the speakerTransmission delay tau from position to element p p The phase of the array element p at the frequency point m is defined as follows:
H p,m =exp(-j2πmfτ p )
wherein the method comprises the steps ofFor the frequency domain sampling interval, fs is the system sampling rate, and M is the number of FFT points.
If the microphone array is in other shapes, the phase calculation method of the array element p at the frequency point m is the same.
The cross-correlation phase between the array element p and the array element q at the frequency point m is as follows:
assuming that a frequency point sequence from M1 to M2 is selected from M frequency points, the cross-correlation accumulated power calculation for frequency domain azimuth estimation can be obtained:
the microphone array has P array elements, and then the cross-correlation accumulated power of the frequency domain and spatial domain azimuth estimates is as follows:
if the speaker bearing is estimated using signals of multiple data frames in the time domain, then the cross-correlation accumulated power of the time domain, frequency domain and spatial domain bearing estimates needs to be calculated as follows:
(4) And searching the cross-correlation accumulated power maximum of the time domain, frequency domain and space domain azimuth estimates.
Circumference at 0 to 360 degreesIn accordance with the angular interval delta θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of the angle search is N θ The cross-correlation accumulated power sequence is as follows:
PTGFST=[PTGFST(Δ θ )PTGFST(2Δ θ )…PTGFST(N θ Δ θ )] T
searching the maximum value of the cross-correlation accumulated power sequence, and finding out the angle corresponding to the maximum value, namely the azimuth estimation value of the speaker, as follows:
according to another embodiment of the present application, an apparatus for estimating a speaker using a microphone array is provided. Referring to fig. 7, the apparatus comprises a microphone array 10, a memory, a processor 1 and a computer program stored on the memory and executable on the processor. The processor 1 implements the method as described above when executing the program to process the speaker signals of the microphone array acquisition 10. Specifically, the processor 1 includes:
an array noise spectrum estimation module 11 which receives the acquisition signals sent from the microphone array 10, calculates an array noise spectrum using classical spectral subtraction, and sends the array noise spectrum data to an array frequency domain azimuth estimation factor generation module 12;
an array frequency domain orientation estimation factor generation module 12, configured to receive the noise amplitude spectrum sequence, and calculate a frequency domain orientation estimation factor between each array element of the microphone array 10 according to the noise amplitude spectrum sequence and the frequency domain sequence of the signal acquired by the microphone array 10;
a cross-correlation accumulated power generating module 13 for receiving the frequency domain azimuth estimation factors sent from the array frequency domain azimuth estimation factor generating module 12, calculating a cross-correlation accumulated power sequence in combination with the time domain, the frequency domain and the space domain, and sending the power sequence to an accumulated power sequence maximum value searching module 14;
a maximum value searching module 14, configured to receive the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, search for a maximum value thereof, record the number of angle interval changes corresponding to the maximum value, and send the number of angle interval changes to the speaker azimuth estimating module 15;
and the speaker azimuth estimating module 15 is used for receiving the angle interval output by the maximum value searching module 14 and calculating a speaker azimuth estimated value.
The microphone array 10 collects speaker signals and sends them to the array noise spectrum estimation module 11.
According to yet another embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described above.
In scenario 1 shown in fig. 2 and scenario 2 shown in fig. 3, respectively, the speaker orientation is estimated using the method of the above-described embodiment, and the resulting speaker orientation curves are shown in fig. 8 and fig. 9, respectively. As can be seen from fig. 8 and 9, in the case of the occurrence of a noise source, the method of the embodiment can more accurately estimate the speaker's azimuth.
Scene 1 shown in fig. 2: the speaker is 3 meters from the array, sequentially speaking at three directions of 45 degrees, 90 degrees and 135 degrees, and has no open noise source.
Scene 2 shown in fig. 3: the speaker is 3 meters away from the array, sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, opens a noise source at the position of 200 degrees and 3 meters, plays white noise, and has a signal-to-noise ratio of 5dB.
The above-described embodiments are provided for illustrating the technical concept and features of the present application, and are intended to be preferred embodiments for those skilled in the art to understand the present application and implement the same according to the present application, not to limit the scope of the present application. All equivalent changes or modifications made according to the spirit of the present application should be included in the scope of the present application.
Claims (10)
1. A method for estimating speaker orientation with a microphone array, comprising the steps of:
s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence;
s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor;
and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker.
2. The method of claim 1, wherein the microphone array comprises P array elements, the microphone array collects a frame length of a frame of signals to be L, and performs M-point fast fourier transform operation on the frame of signals with the length of L, where M represents the number of points of the fast fourier transform;
one frame signal x collected by array element p p,t The definition is as follows:
x p,t =[x p,t,1 x p,t,2 … x p,t,L ] T
wherein, P represents the array element number, t=1, 2, …, P; t represents the number of frames, the T frame, t=1, 2, …, T is the total number of frames; x is x p,t,l Representing the signal time domain sample values of the p-th array element, the t-th frame and the first sampling moment;
the frame signal is subjected to M-point fast Fourier transform operation to obtain a signal spectrum sequence X p,t The following are provided:
X p,t =[X p,t,1 X p,t,2 … x p,t,M ] T
wherein X is p,t,m Representing the signal spectrum sample values of the p-th array element, the t-th frame and the m-th frequency sampling point.
3. The method according to claim 2, wherein step S2 specifically comprises:
according to the signal spectrum sequence X p,t Calculating the power spectrum P of the signal acquired by the array element P p,t
P p,t =[P p,t,1 P p,t,2 … P p,t,M ] T
Wherein,P p,t,m signal power spectrum value representing p-th array element, t-th frame and m-th frequency sampling point,/L>Representing the signal spectrum conjugate values of the p-th array element, the t-th frame and the M-th frequency sampling point, wherein m=1, 2, … and M;
according to the noise amplitude spectrum sequence N of the array element p p,t Calculating noise signal power spectrum Q of array element p p,t The following are provided:
Q p,t =[Q p,t,1 Q p,t,2 … Q p,t,M ] T
wherein Q is p,t,m =N p,t,m ×N p,t,m ,Q p,t,m Noise power spectrum value representing the p-th array element, t-th frame and m-th frequency sampling point, N p,t,m Representing noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;
calculating a frequency domain azimuth estimation factor PT of the mth frequency point between the array element p and the array element q according to the following formula p,q,t,m ,
Wherein P is q,t,m Representing the signal power spectrum value of the Q-th array element, the t-th frame and the m-th frequency sampling point, Q p,t,m Representing the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Q q,t,m Noise power spectrum value representing the q-th array element, t-th frame and m-th frequency sampling point, delta being the weighting factorA number.
4. A method according to claim 2 or 3, wherein step S3 comprises:
let it be assumed that the cross-correlation spectrum G between element p and element q p,q,t The following are provided:
G p,q,t =[G p,q,t,1 G p,q,t,2 … G p,q,t,M ] T
wherein the method comprises the steps of Representing the signal spectrum conjugate values of the q-th array element, the t-th frame and the m-th frequency sampling point;
from the frequency domain bearing estimation factor PT p,q,t,m Calculating cross-correlation power PTG of frequency domain azimuth estimation factors p,q,t The following are provided:
PTG p,q,t
=[PTG p,q,t,1 PTG p,q,t,2 … PTG p,q,t,M ] T
wherein PTG p,q,t,m =PT p,q,t,m ×G p,q,t,m ;
Assuming that the azimuth angle of the speaker is theta, defining the transmission delay from the speaker position to the array element p as tau p
Where r is the distance from the array element p to the geometric center of the microphone array, θ p An azimuth angle of the array element p, and c is sound velocity;
transmission delay tau from speaker position to array element p p Defining phase H of array element p at frequency point m p,m The following are provided:
H p,m =exp(-j2πmfτ p )
wherein the frequency domain samplesSpacing offs is the system sampling rate;
cross-correlation phase H at frequency point m between element p and element q p,q,m The following are provided:
assume that cross-correlation accumulated power PTGF for selecting a sequence of M1 to M2 frequency points from M frequency points for frequency domain orientation estimation p,q,t And (3) calculating, namely:
cross-correlation accumulated power PTGFS of frequency domain and space domain azimuth estimation t The following are provided:
5. the method according to claim 4, wherein in the step S3, the speaker bearing is estimated using the signals of the plurality of data frames in the time domain, and the cross-correlation accumulated power PTGFST (θ) of the time domain, the frequency domain and the spatial domain bearing estimates is calculated as follows:
6. the method according to claim 5, wherein in step S4, the angular interval delta is set within a circumference of 0 to 360 degrees θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming the total of angle searchesThe number of points is N θ The cross-correlation accumulated power PTGFST sequence is as follows:
PTGFST
=[P T G F S T (Δ θ ) P T G F S T (2Δ θ ) … P T G F S T (N θ Δ θ )] T
searching the maximum value of the cross-correlation accumulated power sequence, and finding out the angle corresponding to the maximum value, namely the azimuth estimation value of the speaker, as follows:
7. the method of claim 5, wherein the microphone array is a circular array and r is the distance from the array element p to the center of the microphone array.
8. An apparatus for estimating a speaker position using a microphone array, comprising a microphone array, a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor performs the method of any of claims 1-7 when the program is executed to process speaker signals acquired by the microphone array.
9. The apparatus of claim 8, wherein the processor comprises:
the array noise spectrum estimation module is used for receiving the signals acquired by the microphone array and calculating a noise amplitude spectrum sequence;
the array frequency domain azimuth estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
the cross-correlation accumulated power generation module is used for receiving the frequency domain azimuth estimation factors output by the array frequency domain azimuth estimation factor generation module and calculating a cross-correlation accumulated power sequence by combining a time domain, a frequency domain and a space domain;
the maximum value searching module is used for receiving the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, searching the maximum value of the cross-correlation accumulated power sequence and recording the angle interval corresponding to the maximum value;
and the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110664316.0A CN113470682B (en) | 2021-06-16 | 2021-06-16 | Method, device and storage medium for estimating speaker azimuth by microphone array |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110664316.0A CN113470682B (en) | 2021-06-16 | 2021-06-16 | Method, device and storage medium for estimating speaker azimuth by microphone array |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113470682A CN113470682A (en) | 2021-10-01 |
CN113470682B true CN113470682B (en) | 2023-11-24 |
Family
ID=77869967
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110664316.0A Active CN113470682B (en) | 2021-06-16 | 2021-06-16 | Method, device and storage medium for estimating speaker azimuth by microphone array |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113470682B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006194700A (en) * | 2005-01-12 | 2006-07-27 | Hiroshima Industrial Promotion Organization | Sound source direction estimation system, sound source direction estimation method and sound source direction estimation program |
JP2007006253A (en) * | 2005-06-24 | 2007-01-11 | Sony Corp | Signal processor, microphone system, and method and program for detecting speaker direction |
WO2008041878A2 (en) * | 2006-10-04 | 2008-04-10 | Micronas Nit | System and procedure of hands free speech communication using a microphone array |
CN104142492A (en) * | 2014-07-29 | 2014-11-12 | 佛山科学技术学院 | SRP-PHAT multi-source spatial positioning method |
CN106199607A (en) * | 2016-06-29 | 2016-12-07 | 北京捷通华声科技股份有限公司 | The Sounnd source direction localization method of a kind of microphone array and device |
CN109188362A (en) * | 2018-09-03 | 2019-01-11 | 中国科学院声学研究所 | A kind of microphone array auditory localization signal processing method |
CN110488223A (en) * | 2019-07-05 | 2019-11-22 | 东北电力大学 | A kind of sound localization method |
CN112216295A (en) * | 2019-06-25 | 2021-01-12 | 大众问问(北京)信息科技有限公司 | Sound source positioning method, device and equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010091077A1 (en) * | 2009-02-03 | 2010-08-12 | University Of Ottawa | Method and system for a multi-microphone noise reduction |
-
2021
- 2021-06-16 CN CN202110664316.0A patent/CN113470682B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006194700A (en) * | 2005-01-12 | 2006-07-27 | Hiroshima Industrial Promotion Organization | Sound source direction estimation system, sound source direction estimation method and sound source direction estimation program |
JP2007006253A (en) * | 2005-06-24 | 2007-01-11 | Sony Corp | Signal processor, microphone system, and method and program for detecting speaker direction |
WO2008041878A2 (en) * | 2006-10-04 | 2008-04-10 | Micronas Nit | System and procedure of hands free speech communication using a microphone array |
CN104142492A (en) * | 2014-07-29 | 2014-11-12 | 佛山科学技术学院 | SRP-PHAT multi-source spatial positioning method |
CN106199607A (en) * | 2016-06-29 | 2016-12-07 | 北京捷通华声科技股份有限公司 | The Sounnd source direction localization method of a kind of microphone array and device |
CN109188362A (en) * | 2018-09-03 | 2019-01-11 | 中国科学院声学研究所 | A kind of microphone array auditory localization signal processing method |
CN112216295A (en) * | 2019-06-25 | 2021-01-12 | 大众问问(北京)信息科技有限公司 | Sound source positioning method, device and equipment |
CN110488223A (en) * | 2019-07-05 | 2019-11-22 | 东北电力大学 | A kind of sound localization method |
Non-Patent Citations (2)
Title |
---|
Tracking of a moving object by following the sound source;Jongho Han;《2012 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)》;全文 * |
基于嵌入式的麦克风阵列声源定位系统算法研究;张坤;《中国优秀硕士学位论文全文数据库》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113470682A (en) | 2021-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9837099B1 (en) | Method and system for beam selection in microphone array beamformers | |
CN110556103B (en) | Audio signal processing method, device, system, equipment and storage medium | |
CN107993670B (en) | Microphone array speech enhancement method based on statistical model | |
US10123113B2 (en) | Selective audio source enhancement | |
US9984702B2 (en) | Extraction of reverberant sound using microphone arrays | |
US8675890B2 (en) | Speaker localization | |
JP4815661B2 (en) | Signal processing apparatus and signal processing method | |
CN111044973B (en) | MVDR target sound source directional pickup method for microphone matrix | |
Kumatani et al. | Microphone array processing for distant speech recognition: Towards real-world deployment | |
CN107919133A (en) | For the speech-enhancement system and sound enhancement method of destination object | |
US9838782B2 (en) | Adaptive mixing of sub-band signals | |
CN110517701B (en) | Microphone array speech enhancement method and implementation device | |
JP2008079256A (en) | Acoustic signal processing apparatus, acoustic signal processing method, and program | |
CN110534126B (en) | Sound source positioning and voice enhancement method and system based on fixed beam forming | |
Raykar et al. | Speaker localization using excitation source information in speech | |
Lleida et al. | Robust continuous speech recognition system based on a microphone array | |
CN113470682B (en) | Method, device and storage medium for estimating speaker azimuth by microphone array | |
Wan et al. | Improved steered response power method for sound source localization based on principal eigenvector | |
CN116312602B (en) | Voice signal beam forming method based on interference noise space spectrum matrix | |
Denda et al. | Robust talker direction estimation based on weighted CSP analysis and maximum likelihood estimation | |
Guo et al. | Underwater target detection and localization with feature map and CNN-based classification | |
JP5635024B2 (en) | Acoustic signal emphasizing device, perspective determination device, method and program thereof | |
CN111210836B (en) | Dynamic adjustment method for microphone array beam forming | |
CN117037836B (en) | Real-time sound source separation method and device based on signal covariance matrix reconstruction | |
Tiantian et al. | Underwater Acoustic Sensing with Rational Orthogonal Wavelet Pulse and Auditory Frequency Cepstral Coefficient-Based Feature Extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |