CN113470682A

CN113470682A - Method, device and storage medium for estimating speaker orientation by microphone array

Info

Publication number: CN113470682A
Application number: CN202110664316.0A
Authority: CN
Inventors: 马登永; 蔡野锋; 沐永生; 叶超
Original assignee: Zhongke Shangsheng Suzhou Electronics Co ltd
Current assignee: Zhongke Shangsheng Suzhou Electronics Co ltd
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-10-01
Anticipated expiration: 2041-06-16
Also published as: CN113470682B

Abstract

The invention discloses a method, a device and a storage medium for estimating the orientation of a speaker by using a microphone array. The method comprises the following steps: s1, carrying out noise spectrum estimation on the signals collected by the microphone array to obtain a noise magnitude spectrum sequence; s2, calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise magnitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array; s3, calculating a cross-correlation accumulation power sequence of the frequency domain and the space domain orientation estimation of the microphone array according to the frequency domain orientation estimation factor; and S4, searching the maximum value of the cross-correlation accumulation power sequence, and taking the angle corresponding to the maximum value as the direction estimation value of the speaker. The invention can accurately estimate the speaker orientation in a noise environment.

Description

Method, device and storage medium for estimating speaker orientation by microphone array

Technical Field

The invention belongs to the field of voice, particularly relates to the field of microphone array pickup, and relates to a method and a device for estimating the direction of a speaker by using a microphone array and a storage medium.

Background

With continuous breakthrough and development of artificial intelligence theory, artificial intelligence makes great progress in the speech field, and branch directions of processing of various speech signals such as echo cancellation, speech enhancement, speech recognition, voiceprint recognition, semantic analysis, semantic understanding and the like are rapidly developed, and the branch directions all face a common problem, namely how to pick up high-quality speech signals. At present, in order to improve the quality of picked-up voice signals, a microphone array is adopted to pick up voice signals of a specific direction in which a speaker is located with high quality, and estimation of the direction in which the speaker is located becomes a key part of the microphone array to pick up voice with high quality. The orientation estimation of the speaker is affected by both the ambient reverberation and the ambient noise, and the estimation precision usually has a large error.

At present, the most classical method for estimating the speaker orientation by using a microphone array is the SRP-PHAT (stereo-Response Power-Phase Transformation) method, but the SRP-PHAT method has a good suppression capability for the reverberation environment, but slightly generates the ambient noise, the SRP-PHAT method will erroneously estimate the noise orientation, and the SRP-PHAT method cannot suppress the ambient noise. The microphone array direction-of-arrival estimation based on the SRP-PHAT method as proposed in chinese patent application 201410366922.4, in the presence of noise sources, cannot estimate the speaker's direction but the direction of the noise sources, and this method has drawbacks in noise environments.

Fig. 1 shows an array element position diagram of a three-element microphone array, wherein three microphones are spaced at 120 degrees and uniformly distributed on a circumference with a radius of 4 cm. As shown in fig. 2 and 3, speaker localization using the microphone array is performed using the SRP-PHAT method in two scenarios. Scenario 1 shown in fig. 2: the speaker is 3 meters away from the array, and speaks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence without turning on a noise source; scenario 2 shown in fig. 3: the speaker is 3 meters away from the array, talks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence, a noise source is turned on in the 200-degree direction and 3 meters, white noise is played, and the signal-to-noise ratio is 5 dB. As shown in FIG. 4, in the scenario 1 case, without a noise source, the SRP-PHAT method can estimate the speaker's orientation. As shown in fig. 5, in the presence of noise sources, the SRP-PHAT method cannot estimate the speaker's orientation, and more often estimates the orientation of the noise sources.

By comparing fig. 4 and fig. 5, we can see that the SRP-PHAT method can effectively estimate the orientation of the speaker in the reverberation environment, but the SRP-PHAT method cannot estimate the orientation of the speaker and can estimate more the orientation of the noise source in the presence of the noise source.

In summary, (1) the existing speaker position estimation method, such as SRP (stepped-Response Power) method, may fail to estimate the speaker position in a scene with reverberation. In order to solve the anti-reverberation performance of the speaker position estimation method, many scholars propose a classical SRP-PHAT (stereo-Response Power-Phase Transformation) method for speaker positioning in a reverberation environment, and the SRP-PHAT method achieves a more ideal positioning effect in the reverberation environment. (2) The existing speaker orientation estimation method often cannot accurately estimate the orientation of a speaker in a scene with a noise source, and particularly, under the condition that the noise source occurs, the orientation estimated by the SRP-PHAT method is found to be the orientation of the noise source, so that the position of the speaker cannot be located. This has led to the situation where the SRP-PHAT method fails in speaker localization in environments where noise sources are present.

Aiming at the condition that the SRP-PHAT method can not be applied to a noise environment, the method and the device for realizing the speaker orientation estimation by using the microphone array are provided by the disclosure, so that the microphone array can accurately estimate the speaker orientation in both a reverberation environment and a noise environment, and the aim of picking up the voice of the speaker with high quality is fulfilled.

Disclosure of Invention

The invention aims to provide a method, a device and a storage medium for estimating the orientation of a speaker by using a microphone array, which can accurately estimate the orientation of the speaker in a noise environment.

According to a first aspect of the present invention, there is provided a method of estimating a speaker's orientation using a microphone array, comprising the steps of:

s1, carrying out noise spectrum estimation on the signals collected by the microphone array to obtain a noise magnitude spectrum sequence;

s2, calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise magnitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array;

s3, calculating a cross-correlation accumulation power sequence of the frequency domain and the space domain orientation estimation of the microphone array according to the frequency domain orientation estimation factor;

and S4, searching the maximum value of the cross-correlation accumulation power sequence, and taking the angle corresponding to the maximum value as the direction estimation value of the speaker.

In some embodiments, the microphone array contains P array elements, the microphone array collects a frame of signal with a frame length of L, and the frame of signal with the length of L is subjected to M-point Fast Fourier Transform (FFT) operation, where M represents the number of points of FFT;

one-frame signal x collected by array element p_p,tThe definition is as follows:

x_p,t＝[x_p,t,1 x_p,t,2 … x_p,t,L]^T

wherein, P represents array element number, t is 1, 2, …, and P; t represents the number of frames, the tth frame is 1, 2, …, and T is the total number of frames; x is the number of_p,t,lRepresenting the time domain sample values of the signals of the p-th array element, the t-th frame and the l-th sampling moment;

after the frame signal is subjected to M-point FFT operation, a signal frequency spectrum sequence X is obtained_p,tThe following were used:

X_p,t＝[X_p,t,1 X_p,t,2 … x_p,t,M]^T

wherein, X_p，t，mAnd the signal spectrum sample values represent the p-th array element, the t-th frame and the m-th frequency sampling point.

In some embodiments, step S2 specifically includes:

according to the signal frequency spectrum sequence X_p，tCalculating signals collected by array elements pPower spectrum P of_p,t

P_p，t＝[P_p，t，1 P_p,t，2 … P_p，t，M]^T

Wherein the content of the first and second substances,

P_p，t，mrepresenting the signal power spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point,

the conjugate value of the signal spectrum of the q array element, the t frame and the M frequency sampling point is represented, and M is 1, 2, … and M;

noise amplitude spectrum sequence N according to array element p_p，tNoise signal power spectrum Q for calculating array element p_p，tThe following were used:

Q_p,t＝[Q_p，t，1 Q_p，t，2 … Q_p，t，M]^T

wherein Q_p，t，m＝N_p,t,m×N_p,t,m，Q_p,t,mRepresenting the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, N_p,t,mRepresenting the noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;

calculating the frequency domain orientation estimation factor PT between the array element p and the array element q at the mth frequency point according to the following formula_p,q,t,m，

Wherein, P_q,t,mSignal power spectrum value representing Q array element, t frame and m frequency sampling point, Q_p,t,mRepresenting the noise power spectrum value, Q, of the p-th array element, t-th frame and m-th frequency sampling point_q,t,mAnd the noise power spectrum values of the q array element, the t frame and the m frequency sampling point are represented, and delta is a weighting factor.

In some embodiments, step S3 specifically includes:

assuming a cross-correlation spectrum G between array elements p and q_p,q,tThe following were used:

G_p,q,t＝[G_p,q,t,1 G_p,q,t,2 … G_p,q,t,M]^T

wherein

Representing the signal spectrum conjugate values of the qth array element, the tth frame and the mth frequency sampling point;

estimating a factor PT from frequency domain orientation_p,q,t,mCalculating the cross-correlation power PTG of the frequency domain azimuth estimation factor_p,q,tThe following were used:

PTG_p,q,t＝[PTG_p,q,t,1 PTG_p,q,t,2 … PTG_p,q,t,M]^T

wherein PTG_p,q,t,m＝PT_p,q,t,m×G_p,q,t,m；

Assuming the azimuth angle theta of the speaker, the transmission delay from the speaker position to the array element p is defined as tau_p

Wherein r is the distance from the array element p to the geometric center of the microphone array, and theta_pThe azimuth angle of an array element p is shown, and c is the sound velocity;

delay tau of transmission to array element p according to speaker position_pDefining the phase H of the array element p at the frequency point m_p,mThe following were used:

H_p,m＝exp(-j2πmfτ_p)

wherein the frequency domain sampling interval

fs is the system sampling rate;

cross-correlation phase H between array elements p and q at frequency point m_p,q,mThe following were used:

cross-correlation accumulated power PTGF (power of cross correlation) for frequency domain orientation estimation by assuming that a frequency point sequence from M1 to M2 is selected from M frequency points_p,q,tCalculating to obtain:

cross-correlation accumulated power PTGFS for frequency domain and space domain azimuth estimation_tThe following were used:

in some embodiments, in step S3, the speaker orientation is estimated using the signals of the time domain multiple data frames, and the cross-correlation cumulative power PTGFST (θ) of the time domain, frequency domain and spatial orientation estimates is calculated as follows:

in some embodiments, in step S4, the angular interval Δ is set within a circle of 0 to 360 degrees_θSequentially changing the azimuth theta of the speaker to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of angle search is N_θThe cross-correlation accumulated power PTGFST sequence is as follows:

PTGFST＝[PTGFST(Δ_θ)PTGFST(2Δ_θ)…PTGFST(N_θΔ_θ)]^T

searching the maximum value of the cross-correlation accumulation power sequence, and finding out the angle corresponding to the maximum value, namely the direction estimation value of the speaker, as follows:

in some embodiments, the microphone array is a circular array, and r is the distance from the array element p to the center of the microphone array.

According to a second aspect of the present invention, there is provided an apparatus for estimating a speaker orientation using a microphone array, comprising a microphone array, a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor performs the program to perform the method described above when processing speaker signals collected by the microphone array.

In some embodiments, the processor comprises:

the array noise spectrum estimation module is used for receiving the signals collected by the microphone array and calculating a noise magnitude spectrum sequence;

the array frequency domain orientation estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array;

the cross-correlation accumulation power generation module is used for receiving the frequency domain orientation estimation factors output by the array frequency domain orientation estimation factor generation module and calculating a cross-correlation accumulation power sequence by combining a time domain, a frequency domain and a space domain;

the maximum value searching module is used for receiving the cross-correlation accumulation power sequence output by the cross-correlation accumulation power generating module, searching the maximum value of the cross-correlation accumulation power sequence and recording the angle interval corresponding to the maximum value; and

and the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.

According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as described above.

Compared with the prior art, the invention has the following advantages by adopting the scheme:

the method for estimating the speaker orientation by using the microphone array has the advantages that the noise spectrum and the voice spectrum are separated, the weight of the noise spectrum on the orientation estimation factor is restrained, the contribution weight of the noise spectrum on the orientation estimation factor is reduced, and the method has an obvious anti-noise effect; the invention has good anti-reverberation performance at the same time, under the reverberation environment, can also estimate the speaker orientation effectively; the algorithm of the clumsy method occupies less resources and is suitable for the engineering realization of a low-computing-power platform.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a diagram of array element positions of a ternary microphone array.

Fig. 2 is a plan view of the speaker and noise orientation.

FIG. 3 is a three-dimensional view of the speaker and noise orientation.

FIG. 4 shows the speaker orientation curve estimated by the SRP-PHAT method without noise sources.

FIG. 5 shows the speaker orientation curve estimated by the SRP-PHAT method in the presence of a noise source.

Fig. 6 shows a flow diagram of a method according to an embodiment of the invention.

Fig. 7 shows a block diagram of an apparatus according to an embodiment of the invention.

FIG. 8 shows the speaker orientation curve estimated under scenario 1 by the method of an embodiment of the present invention.

FIG. 9 shows the speaker orientation curve estimated in scenario 2 by the method of an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the invention may be more readily understood by those skilled in the art. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

According to an embodiment of the present invention, a signal processing diagram of a method for estimating a speaker by using a microphone array is shown in fig. 6, and a flow of the method will be described in detail with reference to fig. 6.

(1) And estimating a noise spectrum.

Assuming that the microphone array has P array elements, the frame length of a frame of signal collected by the array is L, and M-point FFT operation is carried out on the frame of signal with the length of L. Suppose that a frame signal collected by an array element p is defined as follows:

x_p,t＝[x_p,t,1 x_p,t,2 … x_p,t,L]^T

wherein T represents the tth frame signal (T can be from 1 to T)

After the frame signal is subjected to M-point FFT operation, a frequency domain sequence is obtained as follows:

X_p,t＝[X_p,t,1 X_p,t,2 … x_p,t,M]^T

by adopting a classical noise spectrum estimation method, namely an MCRA (minimum controlled RecurrieveAveraging) method, the estimated M-point noise amplitude spectrum sequence is as follows:

N_p,t＝[N_p,t,1 N_p,t,2 … N_p,t,M]^T

(2) and calculating array frequency domain orientation estimation factors.

Signal frequency spectrum sequence X calculated according to array element p_p,tThe power spectrum of the acquired signal of the array element p can be calculated as follows:

P_p,t＝[P_p,t,1 P_p,t,2 … P_p,t,M]^T

wherein

Noise signal amplitude spectrum sequence N estimated according to array element p_p,tThe power spectrum of the noise signal of the array element p can be calculated as follows:

Q_p,t＝[Q_p,t,1 Q_p,t,2 … Q_p,t,M]^T

wherein Q_p,t,m＝N_p,t,m×N_p,t,m

According to a cross correlation method, calculating a frequency domain azimuth estimation factor at the mth frequency point between an array element p and an array element q, as follows:

where δ is a weighting factor, typically 0.25 is chosen.

(3) And calculating the accumulated power of the array time domain and the array frequency domain.

The cross-correlation spectrum between array element p and array element q is assumed as follows:

G_p,q,t＝[G_p,q,t,1 G_p,q,t,2 … G_p,q,t,M]^T

wherein

Estimating a factor PT from frequency domain orientation_p,q,t,mThe cross-correlation power of the frequency domain azimuth estimation factor can be calculated as follows:

PTG_p,q,t＝[PTG_p,q,t,1 PTG_p,q,t,2 … PTG_p,q,t,M]^T

wherein PTG_p,q,t,m＝PT_p,q,t,m×G_p,q,t,m

Assuming that the speaker has an azimuth angle theta,assuming that the microphone array is a circular array and the radius from the array element to the center of the circle is r, the transmission delay from the position of the speaker to the array element p is defined as tau_p

Wherein theta is_pIs the azimuth angle of the array element p and c is the speed of sound.

Delay tau of transmission to array element p according to speaker position_pDefining the phase of the array element p at the frequency point m as follows:

H_p,m＝exp(-j2πmfτ_p)

wherein

Is the frequency domain sampling interval, fs is the system sampling rate, and M is the number of points of FFT.

If the microphone array has other shapes, the phase calculation method of the array element p at the frequency point m is the same.

The cross-correlation phase at frequency point m between array element p and array element q is as follows:

assuming that a frequency point sequence from M1 to M2 is selected from M frequency points, and is used for cross-correlation accumulated power calculation of frequency domain orientation estimation, it can be obtained:

the microphone array has P array elements, and then the cross-correlation accumulated power of the frequency domain and the spatial domain orientation estimation is as follows:

if the speaker's orientation is estimated using the signals of multiple frames of data in the time domain, then the accumulated power of cross-correlation for the time, frequency, and spatial orientation estimates needs to be calculated as follows:

(4) and searching the maximum value of the power accumulated by the cross correlation of the time domain, the frequency domain and the space domain azimuth estimation.

Within a circle of 0 to 360 degrees, at angular intervals Δ_θSequentially changing the azimuth theta of the speaker to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of angle search is N_θThe cross-correlation accumulated power sequence is as follows:

PTGFST＝[PTGFST(Δ_θ)PTGFST(2Δ_θ)…PTGFST(N_θΔ_θ)]^T

according to another embodiment of the present invention, an apparatus for estimating a speaker using a microphone array is provided. Referring to fig. 7, the apparatus comprises a microphone array 10, a memory, a processor 1 and a computer program stored on the memory and executable on the processor. The processor 1 executes the program to perform the method as described above when processing the speaker signals of the microphone array acquisition 10. Specifically, the processor 1 includes:

an array noise spectrum estimation module 11, which receives the collected signals from the microphone array 10, calculates an array noise spectrum by using classical spectral subtraction, and sends the array noise spectrum data to an array frequency domain orientation estimation factor generation module 12;

an array frequency domain orientation estimation factor generation module 12, configured to receive the noise magnitude spectrum sequence, and calculate a frequency domain orientation estimation factor between each array element of the microphone array 10 according to the noise magnitude spectrum sequence and the frequency domain sequence of the signal acquired by the microphone array 10;

a cross-correlation accumulation power generation module 13, configured to receive the frequency domain orientation estimation factor sent by the array frequency domain orientation estimation factor generation module 12, calculate a cross-correlation accumulation power sequence by combining the time domain, the frequency domain, and the space domain, and send the power sequence to an accumulation power sequence maximum value search module 14;

a maximum value searching module 14, configured to receive the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, search for the maximum value of the cross-correlation accumulated power sequence, record the angle interval change number corresponding to the maximum value, and send the angle interval change number to the speaker azimuth angle estimating module 15;

and the speaker azimuth angle estimation module 15 is used for receiving the angle interval output by the maximum value search module 14 and calculating a speaker azimuth angle estimation value.

The microphone array 10 collects speaker signals and sends the speaker signals to the array noise spectrum estimation module 11.

According to yet another embodiment of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method described above.

In scenario 1 shown in fig. 2 and scenario 2 shown in fig. 3, respectively, the speaker orientation is estimated by using the method of the above embodiment, and the resulting speaker orientation curves are shown in fig. 8 and fig. 9, respectively. As can be seen from fig. 8 and 9, the method of the embodiment can estimate the orientation of the speaker more accurately in the presence of noise sources.

Scenario 1 shown in fig. 2: the speaker is 3 meters away from the array, and speaks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence without turning on a noise source.

Scenario 2 shown in fig. 3: the speaker is 3 meters away from the array, talks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence, a noise source is turned on in the 200-degree direction and 3 meters, white noise is played, and the signal-to-noise ratio is 5 dB.

The above embodiments are merely illustrative of the technical ideas and features of the present invention, and are preferred embodiments, which are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A method for estimating a speaker's orientation with a microphone array, comprising the steps of:

2. The method of claim 1, wherein the microphone array comprises P array elements, the microphone array collects a frame of signal with a frame length L, and the frame of signal with the length L is subjected to M-point fast fourier transform operation, wherein M represents the number of points of fast fourier transform;

one-frame signal x collected by array element p_p，tThe definition is as follows:

x_p,t＝[x_p，t，1 x_p，t，2 … x_p，t，L]^T

wherein, P represents array element number, t is 1, 2, …, and P; t represents the number of frames, the tth frame is 1, 2, …, and T is the total number of frames; x is the number of_p，t，lRepresenting the time domain sample values of the signals of the p-th array element, the t-th frame and the l-th sampling moment;

the frame signal is subjected to M-point fast Fourier transform operationThen, a signal spectrum sequence X is obtained_p，tThe following were used:

X_p，t＝[X_p，t，1 X_p，t，2 … x_p,t,M]^T

wherein, X_p,t,mAnd the signal spectrum sample values represent the p-th array element, the t-th frame and the m-th frequency sampling point.

3. The method according to claim 2, wherein step S2 specifically includes:

according to the signal frequency spectrum sequence X_p,tCalculating the power spectrum P of the signal collected by the array element P_p,t

P_p,t＝[P_p,t,1 P_p,t,2 … P_p,t,M]^T

Wherein the content of the first and second substances,

P_p,t,mrepresenting the signal power spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point,

the conjugate value of the signal spectrum of the p-th array element, the t-th frame and the M-th frequency sampling point is represented, and M is 1, 2, … and M;

noise amplitude spectrum sequence N according to array element p_p,tNoise signal power spectrum Q for calculating array element p_p,tThe following were used:

Q_p,t＝[Q_p,t,1 Q_p,t,2 … Q_p,t,M]^T

wherein Q_p,t,m＝N_p,t,m×N_p,t,m，Q_p,t,mRepresenting the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, N_p,t,mRepresenting the noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;

calculating the frequency domain orientation estimation factor PT between the array element p and the array element q at the mth frequency point according to the following formula_p,q,t，m，

Wherein, P_q，t，mSignal power spectrum value representing Q array element, t frame and m frequency sampling point, Q_p，t，mRepresenting the noise power spectrum value, Q, of the p-th array element, t-th frame and m-th frequency sampling point_q,t,mAnd the noise power spectrum values of the q array element, the t frame and the m frequency sampling point are represented, and delta is a weighting factor.

4. The method according to claim 2 or 3, wherein step S3 specifically comprises:

G_p,q,t＝[G_p,q,t,1 G_p,q,t,2 … G_p,q,t,M]^T

wherein

PTG_p,q,t＝[PTG_p,q,t,1 PTG_p,q,t,2 … PTG_p,q,t,M]^T

wherein PTG_p,q,t,m＝PT_p,q,t,m×G_p,q,t,m；

H_p,m＝exp(-j2πmfτ_p)

wherein the frequency domain sampling interval

fs is the system sampling rate;

5. the method according to claim 4, wherein in step S3, the speaker' S orientation is estimated using the signals of the time domain multiple data frames, and the cross-correlation cumulative power PTGFST (θ) of the time domain, frequency domain and spatial orientation estimates is calculated as follows:

6. the method according to claim 5, wherein in step S4, the angular intervals are Δ in degrees within a circle of 0 to 360 degrees_θSequentially changing the azimuth theta of the speaker to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of angle search is N_θThe cross-correlation accumulated power PTGFST sequence is as follows:

PTGFST＝[PTGFST(Δ_θ) PTGFST(2Δ_θ) … PTGFST(N_θΔ_θ)]^T

7. the method of claim 5, wherein the microphone array is a circular array, and r is the distance from the array element p to the center of the microphone array.

8. An apparatus for estimating a speaker's orientation using a microphone array, comprising a microphone array, a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor performs the program to perform the method of any one of claims 1 to 7 when processing speaker signals collected by the microphone array.

9. The apparatus of claim 7, wherein the processor comprises:

the maximum value searching module is used for receiving the cross-correlation accumulation power sequence output by the cross-correlation accumulation power generating module, searching the maximum value of the cross-correlation accumulation power sequence and recording the angle interval corresponding to the maximum value;

10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.