CN113470682B

CN113470682B - Method, device and storage medium for estimating speaker azimuth by microphone array

Info

Publication number: CN113470682B
Application number: CN202110664316.0A
Authority: CN
Inventors: 马登永; 蔡野锋; 沐永生; 叶超
Original assignee: Zhongke Shangsheng Suzhou Electronics Co ltd
Current assignee: Zhongke Shangsheng Suzhou Electronics Co ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2023-11-24
Anticipated expiration: 2041-06-16
Also published as: CN113470682A

Abstract

The application discloses a method, a device and a storage medium for estimating speaker azimuth by using a microphone array. The method comprises the following steps: s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence; s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array; s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor; and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker. The speaker position estimation method and the speaker position estimation device can accurately estimate the speaker position in the noise environment.

Description

Method, device and storage medium for estimating speaker azimuth by microphone array

Technical Field

The application belongs to the field of voice, in particular to the field of microphone array pickup, and relates to a method, a device and a storage medium for estimating the speaker direction by using a microphone array.

Background

Along with the continuous breakthrough and development of artificial intelligence theory, artificial intelligence has made great progress in the speech field, and the branch directions of various speech signal processes such as echo cancellation, speech enhancement, speech recognition, voiceprint recognition, semantic analysis, semantic understanding are rapidly developed, and these branch directions all face a common problem, namely how to pick up high-quality speech signals. At present, in order to improve the quality of voice signal pickup, a microphone array is used to pick up voice signals in a specific direction of a speaker with high quality, and the estimation of the prescribed position of the speaker becomes a key ring of the microphone array to pick up voice with high quality. The speaker's azimuth estimation, which is affected by both ambient reverberation and ambient noise, usually has a large error.

At present, the most classical method for estimating speaker position by using a microphone array is the SRP-PHAT (Steered-Response Power-Phase Transformation) method, and the SRP-PHAT method has good suppression capability to a reverberant environment, but environmental noise slightly occurs, the SRP-PHAT method can erroneously estimate the noise position, and the SRP-PHAT method cannot suppress the environmental noise. The microphone array direction of arrival estimation based on the SRP-phas method as proposed in chinese patent application 201410366922.4, in the presence of noise sources, will not be able to estimate the speaker's position but the noise source's position, which has drawbacks in the noisy environment.

Fig. 1 shows the array element position diagram of a ternary microphone array, three microphones spaced 120 degrees apart and uniformly distributed on a circumference with a radius of 4 cm. As shown in fig. 2 and 3, this microphone array is used for speaker localization using the SRP-phas method according to two scenarios. Scene 1 shown in fig. 2: the speaker is 3 meters away from the array, and sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, and no noise source is opened; scene 2 shown in fig. 3: the speaker is 3 meters away from the array, sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, opens a noise source at the position of 200 degrees and 3 meters, plays white noise, and has a signal-to-noise ratio of 5dB. As shown in fig. 4, in case of scenario 1, the SRP-phas method can estimate the speaker's bearing without a noise source. As shown in fig. 5, in the case of a noise source, the SRP-phas method cannot estimate the speaker's position, and in more cases, estimates the position of the noise source.

By combining the comparison of fig. 4 and fig. 5, we can see that the SRP-phas method can effectively estimate the speaker's azimuth under the reverberant environment, but the SRP-phas method cannot estimate the speaker's azimuth and more estimates the azimuth of the noise source under the condition that the noise source appears.

Summarizing, (1) existing speaker position estimation methods, like the SRP (stered-Response Power) method, can fail speaker position estimation in a scene where reverberation occurs. In order to solve the anti-reverberation performance of the speaker position estimation method, many scholars propose a classical SRP-phas (steored-Response Power-Phase Transformation) method for speaker localization in a reverberant environment, and the SRP-phas method achieves a more ideal localization effect in the reverberant environment. (2) In the existing speaker position estimation method, in the scene of the occurrence of a noise source, the speaker position cannot be estimated accurately, particularly in the classical speaker positioning method, namely the SRP-PHAT method, the position estimated by the SRP-PHAT method is found to be the position of the noise source and the speaker position cannot be positioned under the condition of the occurrence of the noise source. This causes the SRP-PHAT method to fail in speaker localization in an environment where noise sources are present.

Aiming at the situation that the SRP-PHAT method cannot be applied to a noise environment, the method and the device for estimating the speaker orientation by utilizing the microphone array are provided, so that the microphone array can accurately estimate the speaker orientation under the reverberation and noise environments, and the purpose of picking up the voice of the speaker with high quality is achieved.

Disclosure of Invention

The application aims to provide a method, a device and a storage medium for estimating the speaker position by using a microphone array, which can accurately estimate the speaker position in a noise environment.

According to a first aspect of the present application, there is provided a method of estimating a speaker's position using a microphone array, comprising the steps of:

s1, carrying out noise spectrum estimation on signals acquired by a microphone array to obtain a noise amplitude spectrum sequence;

s2, calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;

s3, calculating a cross-correlation accumulated power sequence of frequency domain and airspace azimuth estimation of the microphone array according to the frequency domain azimuth estimation factor;

and S4, searching the maximum value of the cross-correlation accumulated power sequence, and taking the angle corresponding to the maximum value as the azimuth estimation value of the speaker.

In some embodiments, the microphone array includes P array elements, the microphone array collects a frame length of a frame of signal, where the frame length of a frame of signal is L, and performs an M-point Fast Fourier Transform (FFT) operation on a frame of signal with a length of L, where M represents the number of points where M is FFT;

one frame signal x collected by array element p _p,t The definition is as follows:

x _p,t ＝[x _p,t,1 x _p,t,2 … x _p,t,L ] ^T

wherein, P represents the array element number, t=1, 2, …, P; t represents the number of frames, the T frame, t=1, 2, …, T is the total number of frames; x is x _p,t,l Representing the signal time domain sample values of the p-th array element, the t-th frame and the first sampling moment;

after M-point FFT operation, the frame signal obtains a signal spectrum sequence X _p,t The following are provided:

X _p,t ＝[X _p,t,1 X _p,t,2 … x _p,t,M ] ^T

wherein X is _p，t，m Representing the signal spectrum sample values of the p-th array element, the t-th frame and the m-th frequency sampling point.

In some embodiments, step S2 specifically includes:

according to the signal spectrum sequence X _p，t Calculating the power spectrum P of the signal acquired by the array element P _p,t

P _p，t ＝[P _p，t，1 P _p,t，2 … P _p，t，M ] ^T

Wherein,P _p，t，m signal power spectrum value representing p-th array element, t-th frame and m-th frequency sampling point,/L>Signal spectrum conjugate values representing the q-th array element, the t-th frame and the M-th frequency sampling point, wherein m=1, 2, … and M;

according to the noise amplitude spectrum sequence N of the array element p _p，t Calculating noise signal power spectrum Q of array element p _p，t The following are provided:

Q _p,t ＝[Q _p，t，1 Q _p，t，2 … Q _p，t，M ] ^T

wherein Q is _p，t，m ＝N _p,t,m ×N _p,t,m ，Q _p,t,m Noise power spectrum value representing the p-th array element, t-th frame and m-th frequency sampling point, N _p,t,m Representing noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;

calculating a frequency domain azimuth estimation factor PT of the mth frequency point between the array element p and the array element q according to the following formula _p,q,t,m ，

Wherein P is _q,t,m Representing the signal power spectrum value of the Q-th array element, the t-th frame and the m-th frequency sampling point, Q _p,t,m Representing the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Q _q,t,m Represents the noise power spectrum value of the q-th array element, the t-th frame and the m-th frequency sampling point, and delta is a weighting factor.

In some embodiments, step S3 specifically includes:

let it be assumed that the cross-correlation spectrum G between element p and element q _p,q,t The following are provided:

G _p,q,t ＝[G _p,q,t,1 G _p,q,t,2 … G _p,q,t,M ] ^T

wherein the method comprises the steps ofSignal spectrum sharing representing the q-th array element, the t-th frame and the m-th frequency sampling pointA yoke value;

from the frequency domain bearing estimation factor PT _p,q,t,m Calculating cross-correlation power PTG of frequency domain azimuth estimation factors _p,q,t The following are provided:

PTG _p,q,t ＝[PTG _p,q,t,1 PTG _p,q,t,2 … PTG _p,q,t,M ] ^T

wherein PTG _p,q,t,m ＝PT _p,q,t,m ×G _p,q,t,m ；

Assuming that the azimuth angle of the speaker is theta, defining the transmission delay from the speaker position to the array element p as tau _p

Where r is the distance from the array element p to the geometric center of the microphone array, θ _p An azimuth angle of the array element p, and c is sound velocity;

transmission delay tau from speaker position to array element p _p Defining phase H of array element p at frequency point m _p,m The following are provided:

H _p,m ＝exp(-j2πmfτ _p )

wherein the frequency domain sampling intervalfs is the system sampling rate;

cross-correlation phase H at frequency point m between element p and element q _p,q,m The following are provided:

assume that cross-correlation accumulated power PTGF for selecting a sequence of M1 to M2 frequency points from M frequency points for frequency domain orientation estimation _p,q,t And (3) calculating, namely:

cross-correlation accumulated power PTGFS of frequency domain and space domain azimuth estimation _t The following are provided:

in some embodiments, in the step S3, the speaker bearing is estimated using the signals of the time domain multiple data frames, and the cross-correlation accumulated power PTGFST (θ) of the time domain, frequency domain and spatial domain bearing estimates is calculated as follows:

in some embodiments, in step S4, the angular interval delta is set within a circumference of 0 to 360 degrees _θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of the angle search is N _θ The cross-correlation accumulated power PTGFST sequence is as follows:

PTGFST＝[PTGFST(Δ _θ )PTGFST(2Δ _θ )…PTGFST(N _θ Δ _θ )] ^T

searching the maximum value of the cross-correlation accumulated power sequence, and finding out the angle corresponding to the maximum value, namely the azimuth estimation value of the speaker, as follows:

in some embodiments, the microphone array is a circular array, and r is the distance from the array element p to the center of the microphone array.

According to a second aspect of the present application there is provided an apparatus for estimating a speaker's position using a microphone array, comprising a microphone array, a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to perform the method as described above when processing speaker signals acquired by the microphone array.

In some embodiments, the processor comprises:

the array noise spectrum estimation module is used for receiving the signals acquired by the microphone array and calculating a noise amplitude spectrum sequence;

the array frequency domain azimuth estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain azimuth estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals acquired by the microphone array;

the cross-correlation accumulated power generation module is used for receiving the frequency domain azimuth estimation factors output by the array frequency domain azimuth estimation factor generation module and calculating a cross-correlation accumulated power sequence by combining a time domain, a frequency domain and a space domain;

the maximum value searching module is used for receiving the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, searching the maximum value of the cross-correlation accumulated power sequence and recording the angle interval corresponding to the maximum value; a kind of electronic device with high-pressure air-conditioning system

And the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.

According to a third aspect of the present application there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements a method as described above.

Compared with the prior art, the application has the following advantages:

according to the method for estimating the speaker azimuth by using the microphone array, the noise spectrum and the voice spectrum are separated, the weight of the noise spectrum on the azimuth estimation factor is restrained, the contribution weight of the noise spectrum on the azimuth estimation factor is reduced, and the method has an obvious anti-noise effect; the application has good anti-reverberation performance, and can effectively estimate the speaker azimuth under the reverberation environment; the algorithm of the method occupies less resources and is suitable for engineering realization of a low-computation-force platform.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a diagram of array element positions of a ternary microphone array.

Fig. 2 is a plan view of speaker and noise orientations.

Fig. 3 is a three-dimensional view of speaker and noise orientations.

Fig. 4 shows a speaker bearing curve estimated by the SRP-phas method without a noise source.

Fig. 5 shows a speaker bearing curve estimated by the SRP-phas method in the presence of a noise source.

Fig. 6 shows a flow chart of a method according to an embodiment of the application.

Fig. 7 shows a block diagram of an apparatus according to an embodiment of the application.

Fig. 8 shows a speaker bearing curve estimated by the method of an embodiment of the present application under scenario 1.

Fig. 9 shows a speaker bearing curve estimated by the method of an embodiment of the present application in scenario 2.

Detailed Description

Preferred embodiments of the present application will be described in detail below with reference to the attached drawings so that the advantages and features of the present application can be more easily understood by those skilled in the art. The description of these embodiments is provided to assist understanding of the present application, but is not intended to limit the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

In accordance with one embodiment of the present application, a method for estimating a speaker using a microphone array is shown in fig. 6, and the flow of the method is described in detail below with reference to fig. 6.

(1) And estimating a noise spectrum.

Assuming that the microphone array has P array elements, the array acquires a frame signal with the frame length of L and performs M-point FFT operation on the frame signal with the length of L. Let the frame signal acquired by the array element p be defined as follows:

x _p,t ＝[x _p,t,1 x _p,t,2 … x _p,t,L ] ^T

wherein T represents a T-th frame signal (T may be from 1 to T)

After M-point FFT operation, the frame signal obtains the frequency domain sequence as follows:

X _p,t ＝[X _p,t,1 X _p,t,2 … x _p,t,M ] ^T

the classical noise spectrum estimation method, MCRA (MinimaControlledRecursiveAveraging) method, is adopted, and the estimated M-point noise amplitude spectrum sequence is as follows:

N _p,t ＝[N _p,t,1 N _p,t,2 … N _p,t,M ] ^T

(2) And calculating an array frequency domain azimuth estimation factor.

Signal spectrum sequence X calculated from array element p _p,t The acquired signal power spectrum of the array element p can be calculated as follows:

P _p,t ＝[P _p,t,1 P _p,t,2 … P _p,t,M ] ^T

wherein the method comprises the steps of

Noise signal amplitude spectrum sequence N estimated according to array element p _p,t The noise signal power spectrum of the array element p can be calculated as follows:

Q _p,t ＝[Q _p,t,1 Q _p,t,2 … Q _p,t,M ] ^T

wherein Q is _p,t,m ＝N _p,t,m ×N _p,t,m

According to the cross correlation method, calculating a frequency domain azimuth estimation factor between an array element p and an array element q at an mth frequency point, wherein the frequency domain azimuth estimation factor comprises the following steps:

where δ is a weighting factor, typically 0.25 is chosen.

(3) And calculating the accumulated power of the array time domain and the array frequency domain.

The cross-correlation spectrum between element p and element q is assumed as follows:

G _p,q,t ＝[G _p,q,t,1 G _p,q,t,2 … G _p,q,t,M ] ^T

wherein the method comprises the steps of

From the frequency domain bearing estimation factor PT _p,q,t,m The cross-correlation power of the frequency domain bearing estimation factor can be calculated as follows:

PTG _p,q,t ＝[PTG _p,q,t,1 PTG _p,q,t,2 … PTG _p,q,t,M ] ^T

wherein PTG _p,q,t,m ＝PT _p,q,t,m ×G _p,q,t,m

Assuming that the azimuth angle of a speaker is theta, assuming that the microphone array is a circular array and the radius from an array element to the center of a circle is r, defining the transmission delay from the speaker position to an array element p as tau _p

Wherein θ is _p The azimuth of the array element p, c is the sound velocity.

According to the speakerTransmission delay tau from position to element p _p The phase of the array element p at the frequency point m is defined as follows:

H _p,m ＝exp(-j2πmfτ _p )

wherein the method comprises the steps ofFor the frequency domain sampling interval, fs is the system sampling rate, and M is the number of FFT points.

If the microphone array is in other shapes, the phase calculation method of the array element p at the frequency point m is the same.

The cross-correlation phase between the array element p and the array element q at the frequency point m is as follows:

assuming that a frequency point sequence from M1 to M2 is selected from M frequency points, the cross-correlation accumulated power calculation for frequency domain azimuth estimation can be obtained:

the microphone array has P array elements, and then the cross-correlation accumulated power of the frequency domain and spatial domain azimuth estimates is as follows:

if the speaker bearing is estimated using signals of multiple data frames in the time domain, then the cross-correlation accumulated power of the time domain, frequency domain and spatial domain bearing estimates needs to be calculated as follows:

(4) And searching the cross-correlation accumulated power maximum of the time domain, frequency domain and space domain azimuth estimates.

Circumference at 0 to 360 degreesIn accordance with the angular interval delta _θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of the angle search is N _θ The cross-correlation accumulated power sequence is as follows:

PTGFST＝[PTGFST(Δ _θ )PTGFST(2Δ _θ )…PTGFST(N _θ Δ _θ )] ^T

according to another embodiment of the present application, an apparatus for estimating a speaker using a microphone array is provided. Referring to fig. 7, the apparatus comprises a microphone array 10, a memory, a processor 1 and a computer program stored on the memory and executable on the processor. The processor 1 implements the method as described above when executing the program to process the speaker signals of the microphone array acquisition 10. Specifically, the processor 1 includes:

an array noise spectrum estimation module 11 which receives the acquisition signals sent from the microphone array 10, calculates an array noise spectrum using classical spectral subtraction, and sends the array noise spectrum data to an array frequency domain azimuth estimation factor generation module 12;

an array frequency domain orientation estimation factor generation module 12, configured to receive the noise amplitude spectrum sequence, and calculate a frequency domain orientation estimation factor between each array element of the microphone array 10 according to the noise amplitude spectrum sequence and the frequency domain sequence of the signal acquired by the microphone array 10;

a cross-correlation accumulated power generating module 13 for receiving the frequency domain azimuth estimation factors sent from the array frequency domain azimuth estimation factor generating module 12, calculating a cross-correlation accumulated power sequence in combination with the time domain, the frequency domain and the space domain, and sending the power sequence to an accumulated power sequence maximum value searching module 14;

a maximum value searching module 14, configured to receive the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, search for a maximum value thereof, record the number of angle interval changes corresponding to the maximum value, and send the number of angle interval changes to the speaker azimuth estimating module 15;

and the speaker azimuth estimating module 15 is used for receiving the angle interval output by the maximum value searching module 14 and calculating a speaker azimuth estimated value.

The microphone array 10 collects speaker signals and sends them to the array noise spectrum estimation module 11.

According to yet another embodiment of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method described above.

In scenario 1 shown in fig. 2 and scenario 2 shown in fig. 3, respectively, the speaker orientation is estimated using the method of the above-described embodiment, and the resulting speaker orientation curves are shown in fig. 8 and fig. 9, respectively. As can be seen from fig. 8 and 9, in the case of the occurrence of a noise source, the method of the embodiment can more accurately estimate the speaker's azimuth.

Scene 1 shown in fig. 2: the speaker is 3 meters from the array, sequentially speaking at three directions of 45 degrees, 90 degrees and 135 degrees, and has no open noise source.

Scene 2 shown in fig. 3: the speaker is 3 meters away from the array, sequentially speaks at three directions of 45 degrees, 90 degrees and 135 degrees, opens a noise source at the position of 200 degrees and 3 meters, plays white noise, and has a signal-to-noise ratio of 5dB.

The above-described embodiments are provided for illustrating the technical concept and features of the present application, and are intended to be preferred embodiments for those skilled in the art to understand the present application and implement the same according to the present application, not to limit the scope of the present application. All equivalent changes or modifications made according to the spirit of the present application should be included in the scope of the present application.

Claims

1. A method for estimating speaker orientation with a microphone array, comprising the steps of:

2. The method of claim 1, wherein the microphone array comprises P array elements, the microphone array collects a frame length of a frame of signals to be L, and performs M-point fast fourier transform operation on the frame of signals with the length of L, where M represents the number of points of the fast fourier transform;

x _p,t ＝[x _p,t,1 x _p,t,2 … x _p,t,L ] ^T

the frame signal is subjected to M-point fast Fourier transform operation to obtain a signal spectrum sequence X _p,t The following are provided:

X _p,t ＝[X _p,t,1 X _p,t,2 … x _p,t,M ] ^T

wherein X is _p,t,m Representing the signal spectrum sample values of the p-th array element, the t-th frame and the m-th frequency sampling point.

3. The method according to claim 2, wherein step S2 specifically comprises:

according to the signal spectrum sequence X _p,t Calculating the power spectrum P of the signal acquired by the array element P _p,t

P _p,t ＝[P _p,t,1 P _p,t,2 … P _p,t,M ] ^T

Wherein,P _p,t,m signal power spectrum value representing p-th array element, t-th frame and m-th frequency sampling point,/L>Representing the signal spectrum conjugate values of the p-th array element, the t-th frame and the M-th frequency sampling point, wherein m=1, 2, … and M;

according to the noise amplitude spectrum sequence N of the array element p _p,t Calculating noise signal power spectrum Q of array element p _p,t The following are provided:

Q _p,t ＝[Q _p,t,1 Q _p,t,2 … Q _p,t,M ] ^T

wherein Q is _p,t,m ＝N _p,t,m ×N _p,t,m ，Q _p,t,m Noise power spectrum value representing the p-th array element, t-th frame and m-th frequency sampling point, N _p,t,m Representing noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;

Wherein P is _q,t,m Representing the signal power spectrum value of the Q-th array element, the t-th frame and the m-th frequency sampling point, Q _p,t,m Representing the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Q _q,t,m Noise power spectrum value representing the q-th array element, t-th frame and m-th frequency sampling point, delta being the weighting factorA number.

4. A method according to claim 2 or 3, wherein step S3 comprises:

G _p,q,t ＝[G _p,q,t,1 G _p,q,t,2 … G _p,q,t,M ] ^T

wherein the method comprises the steps of Representing the signal spectrum conjugate values of the q-th array element, the t-th frame and the m-th frequency sampling point;

PTG _p,q,t

＝[PTG _p,q,t,1 PTG _p,q,t,2 … PTG _p,q,t,M ] ^T

wherein PTG _p,q,t,m ＝PT _p,q,t,m ×G _p,q,t,m ；

H _p,m ＝exp(-j2πmfτ _p )

wherein the frequency domain samplesSpacing offs is the system sampling rate;

5. the method according to claim 4, wherein in the step S3, the speaker bearing is estimated using the signals of the plurality of data frames in the time domain, and the cross-correlation accumulated power PTGFST (θ) of the time domain, the frequency domain and the spatial domain bearing estimates is calculated as follows:

6. the method according to claim 5, wherein in step S4, the angular interval delta is set within a circumference of 0 to 360 degrees _θ Sequentially changing the speaker azimuth theta to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming the total of angle searchesThe number of points is N _θ The cross-correlation accumulated power PTGFST sequence is as follows:

PTGFST

＝[P T G F S T (Δ _θ ) P T G F S T (2Δ _θ ) … P T G F S T (N _θ Δ _θ )] ^T

7. the method of claim 5, wherein the microphone array is a circular array and r is the distance from the array element p to the center of the microphone array.

8. An apparatus for estimating a speaker position using a microphone array, comprising a microphone array, a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor performs the method of any of claims 1-7 when the program is executed to process speaker signals acquired by the microphone array.

9. The apparatus of claim 8, wherein the processor comprises:

the maximum value searching module is used for receiving the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, searching the maximum value of the cross-correlation accumulated power sequence and recording the angle interval corresponding to the maximum value;

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 7.