CN113470682B - Method, device and storage medium for estimating speaker azimuth by microphone array - Google Patents

Method, device and storage medium for estimating speaker azimuth by microphone array Download PDF

Info

Publication number
CN113470682B
CN113470682B CN202110664316.0A CN202110664316A CN113470682B CN 113470682 B CN113470682 B CN 113470682B CN 202110664316 A CN202110664316 A CN 202110664316A CN 113470682 B CN113470682 B CN 113470682B
Authority
CN
China
Prior art keywords
array
speaker
sequence
frequency domain
azimuth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110664316.0A
Other languages
Chinese (zh)
Other versions
CN113470682A (en
Inventor
马登永
蔡野锋
沐永生
叶超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Shangsheng Suzhou Electronics Co ltd
Original Assignee
Zhongke Shangsheng Suzhou Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Shangsheng Suzhou Electronics Co ltd filed Critical Zhongke Shangsheng Suzhou Electronics Co ltd
Priority to CN202110664316.0A priority Critical patent/CN113470682B/en
Publication of CN113470682A publication Critical patent/CN113470682A/en
Application granted granted Critical
Publication of CN113470682B publication Critical patent/CN113470682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The application discloses a method, a device and a storage medium for estimating speaker azimuth by using a microphone array. The method comprises the following steps: s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence; s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array; s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor; and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker. The speaker position estimation method and the speaker position estimation device can accurately estimate the speaker position in the noise environment.

Description

Method, device and storage medium for estimating speaker azimuth by microphone array
Technical Field
The application belongs to the field of voice, in particular to the field of microphone array pickup, and relates to a method, a device and a storage medium for estimating the speaker direction by using a microphone array.
Background
Along with the continuous breakthrough and development of artificial intelligence theory, artificial intelligence has made great progress in the speech field, and the branch directions of various speech signal processes such as echo cancellation, speech enhancement, speech recognition, voiceprint recognition, semantic analysis, semantic understanding are rapidly developed, and these branch directions all face a common problem, namely how to pick up high-quality speech signals. At present, in order to improve the quality of voice signal pickup, a microphone array is used to pick up voice signals in a specific direction of a speaker with high quality, and the estimation of the prescribed position of the speaker becomes a key ring of the microphone array to pick up voice with high quality. The speaker's azimuth estimation, which is affected by both ambient reverberation and ambient noise, usually has a large error.
At present, the most classical method for estimating speaker position by using a microphone array is the SRP-PHAT (Steered-Response Power-Phase Transformation) method, and the SRP-PHAT method has good suppression capability to a reverberant environment, but environmental noise slightly occurs, the SRP-PHAT method can erroneously estimate the noise position, and the SRP-PHAT method cannot suppress the environmental noise. The microphone array direction of arrival estimation based on the SRP-phas method as proposed in chinese patent application 201410366922.4, in the presence of noise sources, will not be able to estimate the speaker's position but the noise source's position, which has drawbacks in the noisy environment.
Fig. 1 shows the array element position diagram of a ternary microphone array, three microphones spaced 120 degrees apart and uniformly distributed on a circumference with a radius of 4 cm. As shown in fig. 2 and 3, this microphone array is used for speaker localization using the SRP-phas method according to two scenarios. Scene 1 shown in fig. 2: the speaker is 3 meters away from the array, and sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, and no noise source is opened; scene 2 shown in fig. 3: the speaker is 3 meters away from the array, sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, opens a noise source at the position of 200 degrees and 3 meters, plays white noise, and has a signal-to-noise ratio of 5dB. As shown in fig. 4, in case of scenario 1, the SRP-phas method can estimate the speaker's bearing without a noise source. As shown in fig. 5, in the case of a noise source, the SRP-phas method cannot estimate the speaker's position, and in more cases, estimates the position of the noise source.
By combining the comparison of fig. 4 and fig. 5, we can see that the SRP-phas method can effectively estimate the speaker's azimuth under the reverberant environment, but the SRP-phas method cannot estimate the speaker's azimuth and more estimates the azimuth of the noise source under the condition that the noise source appears.
Summarizing, (1) existing speaker position estimation methods, like the SRP (stered-Response Power) method, can fail speaker position estimation in a scene where reverberation occurs. In order to solve the anti-reverberation performance of the speaker position estimation method, many scholars propose a classical SRP-phas (steored-Response Power-Phase Transformation) method for speaker localization in a reverberant environment, and the SRP-phas method achieves a more ideal localization effect in the reverberant environment. (2) In the existing speaker position estimation method, in the scene of the occurrence of a noise source, the speaker position cannot be estimated accurately, particularly in the classical speaker positioning method, namely the SRP-PHAT method, the position estimated by the SRP-PHAT method is found to be the position of the noise source and the speaker position cannot be positioned under the condition of the occurrence of the noise source. This causes the SRP-PHAT method to fail in speaker localization in an environment where noise sources are present.
Aiming at the situation that the SRP-PHAT method cannot be applied to a noise environment, the method and the device for estimating the speaker orientation by utilizing the microphone array are provided, so that the microphone array can accurately estimate the speaker orientation under the reverberation and noise environments, and the purpose of picking up the voice of the speaker with high quality is achieved.
Disclosure of Invention
The application aims to provide a method, a device and a storage medium for estimating the speaker position by using a microphone array, which can accurately estimate the speaker position in a noise environment.
According to a first aspect of the present application, there is provided a method of estimating a speaker's position using a microphone array, comprising the steps of:
s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence;
s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor;
and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker.
In some embodiments, the microphone array includes P array elements, the microphone array collects a frame length of a frame of signal, where the frame length of a frame of signal is L, and performs an M-point Fast Fourier Transform (FFT) operation on a frame of signal with a length of L, where M represents the number of points where M is FFT;
one frame signal x collected by array element p p,t The definition is as follows:
x p,t =[x p,t,1 x p,t,2 … x p,t,L ] T
wherein, P represents the array element number, t=1, 2, …, P; t represents the number of frames, the T frame, t=1, 2, …, T is the total number of frames; x is x p,t,l Representing the signal time domain sample values of the p-th array element, the t-th frame and the first sampling moment;
after M-point FFT operation, the frame signal obtains a signal spectrum sequence X p,t The following are provided:
X p,t =[X p,t,1 X p,t,2 … x p,t,M ] T
wherein X is p,t,m Representing the signal spectrum sample values of the p-th array element, the t-th frame and the m-th frequency sampling point.
In some embodiments, step S2 specifically includes:
according to the signal spectrum sequence X p,t Calculating the power spectrum P of the signal acquired by the array element P p,t
P p,t =[P p,t,1 P p,t,2 … P p,t,M ] T
Wherein,P p,t,m signal power spectrum value representing p-th array element, t-th frame and m-th frequency sampling point,/L>Signal spectrum conjugate values representing the q-th array element, the t-th frame and the M-th frequency sampling point, wherein m=1, 2, … and M;
according to the noise amplitude spectrum sequence N of the array element p p,t Calculating noise signal power spectrum Q of array element p p,t The following are provided:
Q p,t =[Q p,t,1 Q p,t,2 … Q p,t,M ] T
wherein Q is p,t,m =N p,t,m ×N p,t,m ,Q p,t,m Noise power spectrum value representing the p-th array element, t-th frame and m-th frequency sampling point, N p,t,m Representing noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;
calculating a frequency domain azimuth estimation factor PT of the mth frequency point between the array element p and the array element q according to the following formula p,q,t,m
Wherein P is q,t,m Representing the signal power spectrum value of the Q-th array element, the t-th frame and the m-th frequency sampling point, Q p,t,m Representing the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Q q,t,m Represents the noise power spectrum value of the q-th array element, the t-th frame and the m-th frequency sampling point, and delta is a weighting factor.
In some embodiments, step S3 specifically includes:
let it be assumed that the cross-correlation spectrum G between element p and element q p,q,t The following are provided:
G p,q,t =[G p,q,t,1 G p,q,t,2 … G p,q,t,M ] T
wherein the method comprises the steps ofSignal spectrum sharing representing the q-th array element, the t-th frame and the m-th frequency sampling pointA yoke value;
from the frequency domain bearing estimation factor PT p,q,t,m Calculating cross-correlation power PTG of frequency domain azimuth estimation factors p,q,t The following are provided:
PTG p,q,t =[PTG p,q,t,1 PTG p,q,t,2 … PTG p,q,t,M ] T
wherein PTG p,q,t,m =PT p,q,t,m ×G p,q,t,m
Assuming that the azimuth angle of the speaker is theta, defining the transmission delay from the speaker position to the array element p as tau p
Where r is the distance from the array element p to the geometric center of the microphone array, θ p An azimuth angle of the array element p, and c is sound velocity;
transmission delay tau from speaker position to array element p p Defining phase H of array element p at frequency point m p,m The following are provided:
H p,m =exp(-j2πmfτ p )
wherein the frequency domain sampling intervalfs is the system sampling rate;
cross-correlation phase H at frequency point m between element p and element q p,q,m The following are provided:
assume that cross-correlation accumulated power PTGF for selecting a sequence of M1 to M2 frequency points from M frequency points for frequency domain orientation estimation p,q,t And (3) calculating, namely:
cross-correlation accumulated power PTGFS of frequency domain and space domain azimuth estimation t The following are provided:
in some embodiments, in the step S3, the speaker bearing is estimated using the signals of the time domain multiple data frames, and the cross-correlation accumulated power PTGFST (θ) of the time domain, frequency domain and spatial domain bearing estimates is calculated as follows:
in some embodiments, in step S4, the angular interval delta is set within a circumference of 0 to 360 degrees θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of the angle search is N θ The cross-correlation accumulated power PTGFST sequence is as follows:
PTGFST=[PTGFST(Δ θ )PTGFST(2Δ θ )…PTGFST(N θ Δ θ )] T
searching the maximum value of the cross-correlation accumulated power sequence, and finding out the angle corresponding to the maximum value, namely the azimuth estimation value of the speaker, as follows:
in some embodiments, the microphone array is a circular array, and r is the distance from the array element p to the center of the microphone array.
According to a second aspect of the present application there is provided an apparatus for estimating a speaker's position using a microphone array, comprising a microphone array, a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to perform the method as described above when processing speaker signals acquired by the microphone array.
In some embodiments, the processor comprises:
the array noise spectrum estimation module is used for receiving the signals acquired by the microphone array and calculating a noise amplitude spectrum sequence;
the array frequency domain azimuth estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
the cross-correlation accumulated power generation module is used for receiving the frequency domain azimuth estimation factors output by the array frequency domain azimuth estimation factor generation module and calculating a cross-correlation accumulated power sequence by combining a time domain, a frequency domain and a space domain;
the maximum value searching module is used for receiving the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, searching the maximum value of the cross-correlation accumulated power sequence and recording the angle interval corresponding to the maximum value; a kind of electronic device with high-pressure air-conditioning system
And the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.
According to a third aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described above.
Compared with the prior art, the application has the following advantages:
according to the method for estimating the speaker azimuth by using the microphone array, the noise spectrum and the voice spectrum are separated, the weight of the noise spectrum on the azimuth estimation factor is restrained, the contribution weight of the noise spectrum on the azimuth estimation factor is reduced, and the method has an obvious anti-noise effect; the application has good anti-reverberation performance, and can effectively estimate the speaker azimuth under the reverberation environment; the algorithm of the method occupies less resources and is suitable for engineering realization of a low-computation-force platform.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a diagram of array element positions of a ternary microphone array.
Fig. 2 is a plan view of speaker and noise orientations.
Fig. 3 is a three-dimensional view of speaker and noise orientations.
Fig. 4 shows a speaker bearing curve estimated by the SRP-phas method without a noise source.
Fig. 5 shows a speaker bearing curve estimated by the SRP-phas method in the presence of a noise source.
Fig. 6 shows a flow chart of a method according to an embodiment of the application.
Fig. 7 shows a block diagram of an apparatus according to an embodiment of the application.
Fig. 8 shows a speaker bearing curve estimated by the method of an embodiment of the present application under scenario 1.
Fig. 9 shows a speaker bearing curve estimated by the method of an embodiment of the present application in scenario 2.
Detailed Description
Preferred embodiments of the present application will be described in detail below with reference to the attached drawings so that the advantages and features of the present application can be more easily understood by those skilled in the art. The description of these embodiments is provided to assist understanding of the present application, but is not intended to limit the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In accordance with one embodiment of the present application, a method for estimating a speaker using a microphone array is shown in fig. 6, and the flow of the method is described in detail below with reference to fig. 6.
(1) And estimating a noise spectrum.
Assuming that the microphone array has P array elements, the array acquires a frame signal with the frame length of L and performs M-point FFT operation on the frame signal with the length of L. Let the frame signal acquired by the array element p be defined as follows:
x p,t =[x p,t,1 x p,t,2 … x p,t,L ] T
wherein T represents a T-th frame signal (T may be from 1 to T)
After M-point FFT operation, the frame signal obtains the frequency domain sequence as follows:
X p,t =[X p,t,1 X p,t,2 … x p,t,M ] T
the classical noise spectrum estimation method, MCRA (MinimaControlledRecursiveAveraging) method, is adopted, and the estimated M-point noise amplitude spectrum sequence is as follows:
N p,t =[N p,t,1 N p,t,2 … N p,t,M ] T
(2) And calculating an array frequency domain azimuth estimation factor.
Signal spectrum sequence X calculated from array element p p,t The acquired signal power spectrum of the array element p can be calculated as follows:
P p,t =[P p,t,1 P p,t,2 … P p,t,M ] T
wherein the method comprises the steps of
Noise signal amplitude spectrum sequence N estimated according to array element p p,t The noise signal power spectrum of the array element p can be calculated as follows:
Q p,t =[Q p,t,1 Q p,t,2 … Q p,t,M ] T
wherein Q is p,t,m =N p,t,m ×N p,t,m
According to the cross correlation method, calculating a frequency domain azimuth estimation factor between an array element p and an array element q at an mth frequency point, wherein the frequency domain azimuth estimation factor comprises the following steps:
where δ is a weighting factor, typically 0.25 is chosen.
(3) And calculating the accumulated power of the array time domain and the array frequency domain.
The cross-correlation spectrum between element p and element q is assumed as follows:
G p,q,t =[G p,q,t,1 G p,q,t,2 … G p,q,t,M ] T
wherein the method comprises the steps of
From the frequency domain bearing estimation factor PT p,q,t,m The cross-correlation power of the frequency domain bearing estimation factor can be calculated as follows:
PTG p,q,t =[PTG p,q,t,1 PTG p,q,t,2 … PTG p,q,t,M ] T
wherein PTG p,q,t,m =PT p,q,t,m ×G p,q,t,m
Assuming that the azimuth angle of a speaker is theta, assuming that the microphone array is a circular array and the radius from an array element to the center of a circle is r, defining the transmission delay from the speaker position to an array element p as tau p
Wherein θ is p The azimuth of the array element p, c is the sound velocity.
According to the speakerTransmission delay tau from position to element p p The phase of the array element p at the frequency point m is defined as follows:
H p,m =exp(-j2πmfτ p )
wherein the method comprises the steps ofFor the frequency domain sampling interval, fs is the system sampling rate, and M is the number of FFT points.
If the microphone array is in other shapes, the phase calculation method of the array element p at the frequency point m is the same.
The cross-correlation phase between the array element p and the array element q at the frequency point m is as follows:
assuming that a frequency point sequence from M1 to M2 is selected from M frequency points, the cross-correlation accumulated power calculation for frequency domain azimuth estimation can be obtained:
the microphone array has P array elements, and then the cross-correlation accumulated power of the frequency domain and spatial domain azimuth estimates is as follows:
if the speaker bearing is estimated using signals of multiple data frames in the time domain, then the cross-correlation accumulated power of the time domain, frequency domain and spatial domain bearing estimates needs to be calculated as follows:
(4) And searching the cross-correlation accumulated power maximum of the time domain, frequency domain and space domain azimuth estimates.
Circumference at 0 to 360 degreesIn accordance with the angular interval delta θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of the angle search is N θ The cross-correlation accumulated power sequence is as follows:
PTGFST=[PTGFST(Δ θ )PTGFST(2Δ θ )…PTGFST(N θ Δ θ )] T
searching the maximum value of the cross-correlation accumulated power sequence, and finding out the angle corresponding to the maximum value, namely the azimuth estimation value of the speaker, as follows:
according to another embodiment of the present application, an apparatus for estimating a speaker using a microphone array is provided. Referring to fig. 7, the apparatus comprises a microphone array 10, a memory, a processor 1 and a computer program stored on the memory and executable on the processor. The processor 1 implements the method as described above when executing the program to process the speaker signals of the microphone array acquisition 10. Specifically, the processor 1 includes:
an array noise spectrum estimation module 11 which receives the acquisition signals sent from the microphone array 10, calculates an array noise spectrum using classical spectral subtraction, and sends the array noise spectrum data to an array frequency domain azimuth estimation factor generation module 12;
an array frequency domain orientation estimation factor generation module 12, configured to receive the noise amplitude spectrum sequence, and calculate a frequency domain orientation estimation factor between each array element of the microphone array 10 according to the noise amplitude spectrum sequence and the frequency domain sequence of the signal acquired by the microphone array 10;
a cross-correlation accumulated power generating module 13 for receiving the frequency domain azimuth estimation factors sent from the array frequency domain azimuth estimation factor generating module 12, calculating a cross-correlation accumulated power sequence in combination with the time domain, the frequency domain and the space domain, and sending the power sequence to an accumulated power sequence maximum value searching module 14;
a maximum value searching module 14, configured to receive the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, search for a maximum value thereof, record the number of angle interval changes corresponding to the maximum value, and send the number of angle interval changes to the speaker azimuth estimating module 15;
and the speaker azimuth estimating module 15 is used for receiving the angle interval output by the maximum value searching module 14 and calculating a speaker azimuth estimated value.
The microphone array 10 collects speaker signals and sends them to the array noise spectrum estimation module 11.
According to yet another embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described above.
In scenario 1 shown in fig. 2 and scenario 2 shown in fig. 3, respectively, the speaker orientation is estimated using the method of the above-described embodiment, and the resulting speaker orientation curves are shown in fig. 8 and fig. 9, respectively. As can be seen from fig. 8 and 9, in the case of the occurrence of a noise source, the method of the embodiment can more accurately estimate the speaker's azimuth.
Scene 1 shown in fig. 2: the speaker is 3 meters from the array, sequentially speaking at three directions of 45 degrees, 90 degrees and 135 degrees, and has no open noise source.
Scene 2 shown in fig. 3: the speaker is 3 meters away from the array, sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, opens a noise source at the position of 200 degrees and 3 meters, plays white noise, and has a signal-to-noise ratio of 5dB.
The above-described embodiments are provided for illustrating the technical concept and features of the present application, and are intended to be preferred embodiments for those skilled in the art to understand the present application and implement the same according to the present application, not to limit the scope of the present application. All equivalent changes or modifications made according to the spirit of the present application should be included in the scope of the present application.

Claims (10)

1. A method for estimating speaker orientation with a microphone array, comprising the steps of:
s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence;
s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor;
and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker.
2. The method of claim 1, wherein the microphone array comprises P array elements, the microphone array collects a frame length of a frame of signals to be L, and performs M-point fast fourier transform operation on the frame of signals with the length of L, where M represents the number of points of the fast fourier transform;
one frame signal x collected by array element p p,t The definition is as follows:
x p,t =[x p,t,1 x p,t,2 … x p,t,L ] T
wherein, P represents the array element number, t=1, 2, …, P; t represents the number of frames, the T frame, t=1, 2, …, T is the total number of frames; x is x p,t,l Representing the signal time domain sample values of the p-th array element, the t-th frame and the first sampling moment;
the frame signal is subjected to M-point fast Fourier transform operation to obtain a signal spectrum sequence X p,t The following are provided:
X p,t =[X p,t,1 X p,t,2 … x p,t,M ] T
wherein X is p,t,m Representing the signal spectrum sample values of the p-th array element, the t-th frame and the m-th frequency sampling point.
3. The method according to claim 2, wherein step S2 specifically comprises:
according to the signal spectrum sequence X p,t Calculating the power spectrum P of the signal acquired by the array element P p,t
P p,t =[P p,t,1 P p,t,2 … P p,t,M ] T
Wherein,P p,t,m signal power spectrum value representing p-th array element, t-th frame and m-th frequency sampling point,/L>Representing the signal spectrum conjugate values of the p-th array element, the t-th frame and the M-th frequency sampling point, wherein m=1, 2, … and M;
according to the noise amplitude spectrum sequence N of the array element p p,t Calculating noise signal power spectrum Q of array element p p,t The following are provided:
Q p,t =[Q p,t,1 Q p,t,2 … Q p,t,M ] T
wherein Q is p,t,m =N p,t,m ×N p,t,m ,Q p,t,m Noise power spectrum value representing the p-th array element, t-th frame and m-th frequency sampling point, N p,t,m Representing noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;
calculating a frequency domain azimuth estimation factor PT of the mth frequency point between the array element p and the array element q according to the following formula p,q,t,m
Wherein P is q,t,m Representing the signal power spectrum value of the Q-th array element, the t-th frame and the m-th frequency sampling point, Q p,t,m Representing the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Q q,t,m Noise power spectrum value representing the q-th array element, t-th frame and m-th frequency sampling point, delta being the weighting factorA number.
4. A method according to claim 2 or 3, wherein step S3 comprises:
let it be assumed that the cross-correlation spectrum G between element p and element q p,q,t The following are provided:
G p,q,t =[G p,q,t,1 G p,q,t,2 … G p,q,t,M ] T
wherein the method comprises the steps of Representing the signal spectrum conjugate values of the q-th array element, the t-th frame and the m-th frequency sampling point;
from the frequency domain bearing estimation factor PT p,q,t,m Calculating cross-correlation power PTG of frequency domain azimuth estimation factors p,q,t The following are provided:
PTG p,q,t
=[PTG p,q,t,1 PTG p,q,t,2 … PTG p,q,t,M ] T
wherein PTG p,q,t,m =PT p,q,t,m ×G p,q,t,m
Assuming that the azimuth angle of the speaker is theta, defining the transmission delay from the speaker position to the array element p as tau p
Where r is the distance from the array element p to the geometric center of the microphone array, θ p An azimuth angle of the array element p, and c is sound velocity;
transmission delay tau from speaker position to array element p p Defining phase H of array element p at frequency point m p,m The following are provided:
H p,m =exp(-j2πmfτ p )
wherein the frequency domain samplesSpacing offs is the system sampling rate;
cross-correlation phase H at frequency point m between element p and element q p,q,m The following are provided:
assume that cross-correlation accumulated power PTGF for selecting a sequence of M1 to M2 frequency points from M frequency points for frequency domain orientation estimation p,q,t And (3) calculating, namely:
cross-correlation accumulated power PTGFS of frequency domain and space domain azimuth estimation t The following are provided:
5. the method according to claim 4, wherein in the step S3, the speaker bearing is estimated using the signals of the plurality of data frames in the time domain, and the cross-correlation accumulated power PTGFST (θ) of the time domain, the frequency domain and the spatial domain bearing estimates is calculated as follows:
6. the method according to claim 5, wherein in step S4, the angular interval delta is set within a circumference of 0 to 360 degrees θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming the total of angle searchesThe number of points is N θ The cross-correlation accumulated power PTGFST sequence is as follows:
PTGFST
=[P T G F S T (Δ θ ) P T G F S T (2Δ θ ) … P T G F S T (N θ Δ θ )] T
searching the maximum value of the cross-correlation accumulated power sequence, and finding out the angle corresponding to the maximum value, namely the azimuth estimation value of the speaker, as follows:
7. the method of claim 5, wherein the microphone array is a circular array and r is the distance from the array element p to the center of the microphone array.
8. An apparatus for estimating a speaker position using a microphone array, comprising a microphone array, a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor performs the method of any of claims 1-7 when the program is executed to process speaker signals acquired by the microphone array.
9. The apparatus of claim 8, wherein the processor comprises:
the array noise spectrum estimation module is used for receiving the signals acquired by the microphone array and calculating a noise amplitude spectrum sequence;
the array frequency domain azimuth estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;
the cross-correlation accumulated power generation module is used for receiving the frequency domain azimuth estimation factors output by the array frequency domain azimuth estimation factor generation module and calculating a cross-correlation accumulated power sequence by combining a time domain, a frequency domain and a space domain;
the maximum value searching module is used for receiving the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, searching the maximum value of the cross-correlation accumulated power sequence and recording the angle interval corresponding to the maximum value;
and the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 7.
CN202110664316.0A 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array Active CN113470682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110664316.0A CN113470682B (en) 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110664316.0A CN113470682B (en) 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array

Publications (2)

Publication Number Publication Date
CN113470682A CN113470682A (en) 2021-10-01
CN113470682B true CN113470682B (en) 2023-11-24

Family

ID=77869967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110664316.0A Active CN113470682B (en) 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array

Country Status (1)

Country Link
CN (1) CN113470682B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006194700A (en) * 2005-01-12 2006-07-27 Hiroshima Industrial Promotion Organization Sound source direction estimation system, sound source direction estimation method and sound source direction estimation program
JP2007006253A (en) * 2005-06-24 2007-01-11 Sony Corp Signal processor, microphone system, and method and program for detecting speaker direction
WO2008041878A2 (en) * 2006-10-04 2008-04-10 Micronas Nit System and procedure of hands free speech communication using a microphone array
CN104142492A (en) * 2014-07-29 2014-11-12 佛山科学技术学院 SRP-PHAT multi-source spatial positioning method
CN106199607A (en) * 2016-06-29 2016-12-07 北京捷通华声科技股份有限公司 The Sounnd source direction localization method of a kind of microphone array and device
CN109188362A (en) * 2018-09-03 2019-01-11 中国科学院声学研究所 A kind of microphone array auditory localization signal processing method
CN110488223A (en) * 2019-07-05 2019-11-22 东北电力大学 A kind of sound localization method
CN112216295A (en) * 2019-06-25 2021-01-12 大众问问(北京)信息科技有限公司 Sound source positioning method, device and equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010091077A1 (en) * 2009-02-03 2010-08-12 University Of Ottawa Method and system for a multi-microphone noise reduction

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006194700A (en) * 2005-01-12 2006-07-27 Hiroshima Industrial Promotion Organization Sound source direction estimation system, sound source direction estimation method and sound source direction estimation program
JP2007006253A (en) * 2005-06-24 2007-01-11 Sony Corp Signal processor, microphone system, and method and program for detecting speaker direction
WO2008041878A2 (en) * 2006-10-04 2008-04-10 Micronas Nit System and procedure of hands free speech communication using a microphone array
CN104142492A (en) * 2014-07-29 2014-11-12 佛山科学技术学院 SRP-PHAT multi-source spatial positioning method
CN106199607A (en) * 2016-06-29 2016-12-07 北京捷通华声科技股份有限公司 The Sounnd source direction localization method of a kind of microphone array and device
CN109188362A (en) * 2018-09-03 2019-01-11 中国科学院声学研究所 A kind of microphone array auditory localization signal processing method
CN112216295A (en) * 2019-06-25 2021-01-12 大众问问(北京)信息科技有限公司 Sound source positioning method, device and equipment
CN110488223A (en) * 2019-07-05 2019-11-22 东北电力大学 A kind of sound localization method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Tracking of a moving object by following the sound source;Jongho Han;《2012 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM)》;全文 *
基于嵌入式的麦克风阵列声源定位系统算法研究;张坤;《中国优秀硕士学位论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN113470682A (en) 2021-10-01

Similar Documents

Publication Publication Date Title
US9837099B1 (en) Method and system for beam selection in microphone array beamformers
CN110556103B (en) Audio signal processing method, device, system, equipment and storage medium
CN107993670B (en) Microphone array speech enhancement method based on statistical model
US10123113B2 (en) Selective audio source enhancement
US9984702B2 (en) Extraction of reverberant sound using microphone arrays
US8675890B2 (en) Speaker localization
JP4815661B2 (en) Signal processing apparatus and signal processing method
CN111044973B (en) MVDR target sound source directional pickup method for microphone matrix
Kumatani et al. Microphone array processing for distant speech recognition: Towards real-world deployment
CN107919133A (en) For the speech-enhancement system and sound enhancement method of destination object
US9838782B2 (en) Adaptive mixing of sub-band signals
CN110517701B (en) Microphone array speech enhancement method and implementation device
JP2008079256A (en) Acoustic signal processing apparatus, acoustic signal processing method, and program
CN110534126B (en) Sound source positioning and voice enhancement method and system based on fixed beam forming
Raykar et al. Speaker localization using excitation source information in speech
Lleida et al. Robust continuous speech recognition system based on a microphone array
CN113470682B (en) Method, device and storage medium for estimating speaker azimuth by microphone array
Wan et al. Improved steered response power method for sound source localization based on principal eigenvector
CN116312602B (en) Voice signal beam forming method based on interference noise space spectrum matrix
Denda et al. Robust talker direction estimation based on weighted CSP analysis and maximum likelihood estimation
Guo et al. Underwater target detection and localization with feature map and CNN-based classification
JP5635024B2 (en) Acoustic signal emphasizing device, perspective determination device, method and program thereof
CN111210836B (en) Dynamic adjustment method for microphone array beam forming
CN117037836B (en) Real-time sound source separation method and device based on signal covariance matrix reconstruction
Tiantian et al. Underwater Acoustic Sensing with Rational Orthogonal Wavelet Pulse and Auditory Frequency Cepstral Coefficient-Based Feature Extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant