CN113470682A - Method, device and storage medium for estimating speaker orientation by microphone array - Google Patents

Method, device and storage medium for estimating speaker orientation by microphone array Download PDF

Info

Publication number
CN113470682A
CN113470682A CN202110664316.0A CN202110664316A CN113470682A CN 113470682 A CN113470682 A CN 113470682A CN 202110664316 A CN202110664316 A CN 202110664316A CN 113470682 A CN113470682 A CN 113470682A
Authority
CN
China
Prior art keywords
array
speaker
sequence
array element
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110664316.0A
Other languages
Chinese (zh)
Other versions
CN113470682B (en
Inventor
马登永
蔡野锋
沐永生
叶超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Shangsheng Suzhou Electronics Co ltd
Original Assignee
Zhongke Shangsheng Suzhou Electronics Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Shangsheng Suzhou Electronics Co ltd filed Critical Zhongke Shangsheng Suzhou Electronics Co ltd
Priority to CN202110664316.0A priority Critical patent/CN113470682B/en
Publication of CN113470682A publication Critical patent/CN113470682A/en
Application granted granted Critical
Publication of CN113470682B publication Critical patent/CN113470682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Abstract

The invention discloses a method, a device and a storage medium for estimating the orientation of a speaker by using a microphone array. The method comprises the following steps: s1, carrying out noise spectrum estimation on the signals collected by the microphone array to obtain a noise magnitude spectrum sequence; s2, calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise magnitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array; s3, calculating a cross-correlation accumulation power sequence of the frequency domain and the space domain orientation estimation of the microphone array according to the frequency domain orientation estimation factor; and S4, searching the maximum value of the cross-correlation accumulation power sequence, and taking the angle corresponding to the maximum value as the direction estimation value of the speaker. The invention can accurately estimate the speaker orientation in a noise environment.

Description

Method, device and storage medium for estimating speaker orientation by microphone array
Technical Field
The invention belongs to the field of voice, particularly relates to the field of microphone array pickup, and relates to a method and a device for estimating the direction of a speaker by using a microphone array and a storage medium.
Background
With continuous breakthrough and development of artificial intelligence theory, artificial intelligence makes great progress in the speech field, and branch directions of processing of various speech signals such as echo cancellation, speech enhancement, speech recognition, voiceprint recognition, semantic analysis, semantic understanding and the like are rapidly developed, and the branch directions all face a common problem, namely how to pick up high-quality speech signals. At present, in order to improve the quality of picked-up voice signals, a microphone array is adopted to pick up voice signals of a specific direction in which a speaker is located with high quality, and estimation of the direction in which the speaker is located becomes a key part of the microphone array to pick up voice with high quality. The orientation estimation of the speaker is affected by both the ambient reverberation and the ambient noise, and the estimation precision usually has a large error.
At present, the most classical method for estimating the speaker orientation by using a microphone array is the SRP-PHAT (stereo-Response Power-Phase Transformation) method, but the SRP-PHAT method has a good suppression capability for the reverberation environment, but slightly generates the ambient noise, the SRP-PHAT method will erroneously estimate the noise orientation, and the SRP-PHAT method cannot suppress the ambient noise. The microphone array direction-of-arrival estimation based on the SRP-PHAT method as proposed in chinese patent application 201410366922.4, in the presence of noise sources, cannot estimate the speaker's direction but the direction of the noise sources, and this method has drawbacks in noise environments.
Fig. 1 shows an array element position diagram of a three-element microphone array, wherein three microphones are spaced at 120 degrees and uniformly distributed on a circumference with a radius of 4 cm. As shown in fig. 2 and 3, speaker localization using the microphone array is performed using the SRP-PHAT method in two scenarios. Scenario 1 shown in fig. 2: the speaker is 3 meters away from the array, and speaks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence without turning on a noise source; scenario 2 shown in fig. 3: the speaker is 3 meters away from the array, talks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence, a noise source is turned on in the 200-degree direction and 3 meters, white noise is played, and the signal-to-noise ratio is 5 dB. As shown in FIG. 4, in the scenario 1 case, without a noise source, the SRP-PHAT method can estimate the speaker's orientation. As shown in fig. 5, in the presence of noise sources, the SRP-PHAT method cannot estimate the speaker's orientation, and more often estimates the orientation of the noise sources.
By comparing fig. 4 and fig. 5, we can see that the SRP-PHAT method can effectively estimate the orientation of the speaker in the reverberation environment, but the SRP-PHAT method cannot estimate the orientation of the speaker and can estimate more the orientation of the noise source in the presence of the noise source.
In summary, (1) the existing speaker position estimation method, such as SRP (stepped-Response Power) method, may fail to estimate the speaker position in a scene with reverberation. In order to solve the anti-reverberation performance of the speaker position estimation method, many scholars propose a classical SRP-PHAT (stereo-Response Power-Phase Transformation) method for speaker positioning in a reverberation environment, and the SRP-PHAT method achieves a more ideal positioning effect in the reverberation environment. (2) The existing speaker orientation estimation method often cannot accurately estimate the orientation of a speaker in a scene with a noise source, and particularly, under the condition that the noise source occurs, the orientation estimated by the SRP-PHAT method is found to be the orientation of the noise source, so that the position of the speaker cannot be located. This has led to the situation where the SRP-PHAT method fails in speaker localization in environments where noise sources are present.
Aiming at the condition that the SRP-PHAT method can not be applied to a noise environment, the method and the device for realizing the speaker orientation estimation by using the microphone array are provided by the disclosure, so that the microphone array can accurately estimate the speaker orientation in both a reverberation environment and a noise environment, and the aim of picking up the voice of the speaker with high quality is fulfilled.
Disclosure of Invention
The invention aims to provide a method, a device and a storage medium for estimating the orientation of a speaker by using a microphone array, which can accurately estimate the orientation of the speaker in a noise environment.
According to a first aspect of the present invention, there is provided a method of estimating a speaker's orientation using a microphone array, comprising the steps of:
s1, carrying out noise spectrum estimation on the signals collected by the microphone array to obtain a noise magnitude spectrum sequence;
s2, calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise magnitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array;
s3, calculating a cross-correlation accumulation power sequence of the frequency domain and the space domain orientation estimation of the microphone array according to the frequency domain orientation estimation factor;
and S4, searching the maximum value of the cross-correlation accumulation power sequence, and taking the angle corresponding to the maximum value as the direction estimation value of the speaker.
In some embodiments, the microphone array contains P array elements, the microphone array collects a frame of signal with a frame length of L, and the frame of signal with the length of L is subjected to M-point Fast Fourier Transform (FFT) operation, where M represents the number of points of FFT;
one-frame signal x collected by array element pp,tThe definition is as follows:
xp,t=[xp,t,1 xp,t,2 … xp,t,L]T
wherein, P represents array element number, t is 1, 2, …, and P; t represents the number of frames, the tth frame is 1, 2, …, and T is the total number of frames; x is the number ofp,t,lRepresenting the time domain sample values of the signals of the p-th array element, the t-th frame and the l-th sampling moment;
after the frame signal is subjected to M-point FFT operation, a signal frequency spectrum sequence X is obtainedp,tThe following were used:
Xp,t=[Xp,t,1 Xp,t,2 … xp,t,M]T
wherein, Xp,t,mAnd the signal spectrum sample values represent the p-th array element, the t-th frame and the m-th frequency sampling point.
In some embodiments, step S2 specifically includes:
according to the signal frequency spectrum sequence Xp,tCalculating signals collected by array elements pPower spectrum P ofp,t
Pp,t=[Pp,t,1 Pp,t,2 … Pp,t,M]T
Wherein the content of the first and second substances,
Figure BDA0003116245400000031
Pp,t,mrepresenting the signal power spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point,
Figure BDA0003116245400000032
the conjugate value of the signal spectrum of the q array element, the t frame and the M frequency sampling point is represented, and M is 1, 2, … and M;
noise amplitude spectrum sequence N according to array element pp,tNoise signal power spectrum Q for calculating array element pp,tThe following were used:
Qp,t=[Qp,t,1 Qp,t,2 … Qp,t,M]T
wherein Qp,t,m=Np,t,m×Np,t,m,Qp,t,mRepresenting the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Np,t,mRepresenting the noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;
calculating the frequency domain orientation estimation factor PT between the array element p and the array element q at the mth frequency point according to the following formulap,q,t,m
Figure BDA0003116245400000033
Wherein, Pq,t,mSignal power spectrum value representing Q array element, t frame and m frequency sampling point, Qp,t,mRepresenting the noise power spectrum value, Q, of the p-th array element, t-th frame and m-th frequency sampling pointq,t,mAnd the noise power spectrum values of the q array element, the t frame and the m frequency sampling point are represented, and delta is a weighting factor.
In some embodiments, step S3 specifically includes:
assuming a cross-correlation spectrum G between array elements p and qp,q,tThe following were used:
Gp,q,t=[Gp,q,t,1 Gp,q,t,2 … Gp,q,t,M]T
wherein
Figure BDA0003116245400000041
Representing the signal spectrum conjugate values of the qth array element, the tth frame and the mth frequency sampling point;
estimating a factor PT from frequency domain orientationp,q,t,mCalculating the cross-correlation power PTG of the frequency domain azimuth estimation factorp,q,tThe following were used:
PTGp,q,t=[PTGp,q,t,1 PTGp,q,t,2 … PTGp,q,t,M]T
wherein PTGp,q,t,m=PTp,q,t,m×Gp,q,t,m
Assuming the azimuth angle theta of the speaker, the transmission delay from the speaker position to the array element p is defined as taup
Figure BDA0003116245400000042
Wherein r is the distance from the array element p to the geometric center of the microphone array, and thetapThe azimuth angle of an array element p is shown, and c is the sound velocity;
delay tau of transmission to array element p according to speaker positionpDefining the phase H of the array element p at the frequency point mp,mThe following were used:
Hp,m=exp(-j2πmfτp)
wherein the frequency domain sampling interval
Figure BDA0003116245400000043
fs is the system sampling rate;
cross-correlation phase H between array elements p and q at frequency point mp,q,mThe following were used:
Figure BDA0003116245400000044
cross-correlation accumulated power PTGF (power of cross correlation) for frequency domain orientation estimation by assuming that a frequency point sequence from M1 to M2 is selected from M frequency pointsp,q,tCalculating to obtain:
Figure BDA0003116245400000045
cross-correlation accumulated power PTGFS for frequency domain and space domain azimuth estimationtThe following were used:
Figure BDA0003116245400000046
in some embodiments, in step S3, the speaker orientation is estimated using the signals of the time domain multiple data frames, and the cross-correlation cumulative power PTGFST (θ) of the time domain, frequency domain and spatial orientation estimates is calculated as follows:
Figure BDA0003116245400000047
in some embodiments, in step S4, the angular interval Δ is set within a circle of 0 to 360 degreesθSequentially changing the azimuth theta of the speaker to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of angle search is NθThe cross-correlation accumulated power PTGFST sequence is as follows:
PTGFST=[PTGFST(Δθ)PTGFST(2Δθ)…PTGFST(NθΔθ)]T
searching the maximum value of the cross-correlation accumulation power sequence, and finding out the angle corresponding to the maximum value, namely the direction estimation value of the speaker, as follows:
Figure BDA0003116245400000051
in some embodiments, the microphone array is a circular array, and r is the distance from the array element p to the center of the microphone array.
According to a second aspect of the present invention, there is provided an apparatus for estimating a speaker orientation using a microphone array, comprising a microphone array, a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor performs the program to perform the method described above when processing speaker signals collected by the microphone array.
In some embodiments, the processor comprises:
the array noise spectrum estimation module is used for receiving the signals collected by the microphone array and calculating a noise magnitude spectrum sequence;
the array frequency domain orientation estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array;
the cross-correlation accumulation power generation module is used for receiving the frequency domain orientation estimation factors output by the array frequency domain orientation estimation factor generation module and calculating a cross-correlation accumulation power sequence by combining a time domain, a frequency domain and a space domain;
the maximum value searching module is used for receiving the cross-correlation accumulation power sequence output by the cross-correlation accumulation power generating module, searching the maximum value of the cross-correlation accumulation power sequence and recording the angle interval corresponding to the maximum value; and
and the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.
According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as described above.
Compared with the prior art, the invention has the following advantages by adopting the scheme:
the method for estimating the speaker orientation by using the microphone array has the advantages that the noise spectrum and the voice spectrum are separated, the weight of the noise spectrum on the orientation estimation factor is restrained, the contribution weight of the noise spectrum on the orientation estimation factor is reduced, and the method has an obvious anti-noise effect; the invention has good anti-reverberation performance at the same time, under the reverberation environment, can also estimate the speaker orientation effectively; the algorithm of the clumsy method occupies less resources and is suitable for the engineering realization of a low-computing-power platform.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a diagram of array element positions of a ternary microphone array.
Fig. 2 is a plan view of the speaker and noise orientation.
FIG. 3 is a three-dimensional view of the speaker and noise orientation.
FIG. 4 shows the speaker orientation curve estimated by the SRP-PHAT method without noise sources.
FIG. 5 shows the speaker orientation curve estimated by the SRP-PHAT method in the presence of a noise source.
Fig. 6 shows a flow diagram of a method according to an embodiment of the invention.
Fig. 7 shows a block diagram of an apparatus according to an embodiment of the invention.
FIG. 8 shows the speaker orientation curve estimated under scenario 1 by the method of an embodiment of the present invention.
FIG. 9 shows the speaker orientation curve estimated in scenario 2 by the method of an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings so that the advantages and features of the invention may be more readily understood by those skilled in the art. It should be noted that the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
According to an embodiment of the present invention, a signal processing diagram of a method for estimating a speaker by using a microphone array is shown in fig. 6, and a flow of the method will be described in detail with reference to fig. 6.
(1) And estimating a noise spectrum.
Assuming that the microphone array has P array elements, the frame length of a frame of signal collected by the array is L, and M-point FFT operation is carried out on the frame of signal with the length of L. Suppose that a frame signal collected by an array element p is defined as follows:
xp,t=[xp,t,1 xp,t,2 … xp,t,L]T
wherein T represents the tth frame signal (T can be from 1 to T)
After the frame signal is subjected to M-point FFT operation, a frequency domain sequence is obtained as follows:
Xp,t=[Xp,t,1 Xp,t,2 … xp,t,M]T
by adopting a classical noise spectrum estimation method, namely an MCRA (minimum controlled RecurrieveAveraging) method, the estimated M-point noise amplitude spectrum sequence is as follows:
Np,t=[Np,t,1 Np,t,2 … Np,t,M]T
(2) and calculating array frequency domain orientation estimation factors.
Signal frequency spectrum sequence X calculated according to array element pp,tThe power spectrum of the acquired signal of the array element p can be calculated as follows:
Pp,t=[Pp,t,1 Pp,t,2 … Pp,t,M]T
wherein
Figure BDA0003116245400000071
Noise signal amplitude spectrum sequence N estimated according to array element pp,tThe power spectrum of the noise signal of the array element p can be calculated as follows:
Qp,t=[Qp,t,1 Qp,t,2 … Qp,t,M]T
wherein Qp,t,m=Np,t,m×Np,t,m
According to a cross correlation method, calculating a frequency domain azimuth estimation factor at the mth frequency point between an array element p and an array element q, as follows:
Figure BDA0003116245400000072
where δ is a weighting factor, typically 0.25 is chosen.
(3) And calculating the accumulated power of the array time domain and the array frequency domain.
The cross-correlation spectrum between array element p and array element q is assumed as follows:
Gp,q,t=[Gp,q,t,1 Gp,q,t,2 … Gp,q,t,M]T
wherein
Figure BDA0003116245400000073
Estimating a factor PT from frequency domain orientationp,q,t,mThe cross-correlation power of the frequency domain azimuth estimation factor can be calculated as follows:
PTGp,q,t=[PTGp,q,t,1 PTGp,q,t,2 … PTGp,q,t,M]T
wherein PTGp,q,t,m=PTp,q,t,m×Gp,q,t,m
Assuming that the speaker has an azimuth angle theta,assuming that the microphone array is a circular array and the radius from the array element to the center of the circle is r, the transmission delay from the position of the speaker to the array element p is defined as taup
Figure BDA0003116245400000081
Wherein theta ispIs the azimuth angle of the array element p and c is the speed of sound.
Delay tau of transmission to array element p according to speaker positionpDefining the phase of the array element p at the frequency point m as follows:
Hp,m=exp(-j2πmfτp)
wherein
Figure BDA0003116245400000082
Is the frequency domain sampling interval, fs is the system sampling rate, and M is the number of points of FFT.
If the microphone array has other shapes, the phase calculation method of the array element p at the frequency point m is the same.
The cross-correlation phase at frequency point m between array element p and array element q is as follows:
Figure BDA0003116245400000083
assuming that a frequency point sequence from M1 to M2 is selected from M frequency points, and is used for cross-correlation accumulated power calculation of frequency domain orientation estimation, it can be obtained:
Figure BDA0003116245400000084
the microphone array has P array elements, and then the cross-correlation accumulated power of the frequency domain and the spatial domain orientation estimation is as follows:
Figure BDA0003116245400000085
if the speaker's orientation is estimated using the signals of multiple frames of data in the time domain, then the accumulated power of cross-correlation for the time, frequency, and spatial orientation estimates needs to be calculated as follows:
Figure BDA0003116245400000086
(4) and searching the maximum value of the power accumulated by the cross correlation of the time domain, the frequency domain and the space domain azimuth estimation.
Within a circle of 0 to 360 degrees, at angular intervals ΔθSequentially changing the azimuth theta of the speaker to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of angle search is NθThe cross-correlation accumulated power sequence is as follows:
PTGFST=[PTGFST(Δθ)PTGFST(2Δθ)…PTGFST(NθΔθ)]T
searching the maximum value of the cross-correlation accumulation power sequence, and finding out the angle corresponding to the maximum value, namely the direction estimation value of the speaker, as follows:
Figure BDA0003116245400000091
according to another embodiment of the present invention, an apparatus for estimating a speaker using a microphone array is provided. Referring to fig. 7, the apparatus comprises a microphone array 10, a memory, a processor 1 and a computer program stored on the memory and executable on the processor. The processor 1 executes the program to perform the method as described above when processing the speaker signals of the microphone array acquisition 10. Specifically, the processor 1 includes:
an array noise spectrum estimation module 11, which receives the collected signals from the microphone array 10, calculates an array noise spectrum by using classical spectral subtraction, and sends the array noise spectrum data to an array frequency domain orientation estimation factor generation module 12;
an array frequency domain orientation estimation factor generation module 12, configured to receive the noise magnitude spectrum sequence, and calculate a frequency domain orientation estimation factor between each array element of the microphone array 10 according to the noise magnitude spectrum sequence and the frequency domain sequence of the signal acquired by the microphone array 10;
a cross-correlation accumulation power generation module 13, configured to receive the frequency domain orientation estimation factor sent by the array frequency domain orientation estimation factor generation module 12, calculate a cross-correlation accumulation power sequence by combining the time domain, the frequency domain, and the space domain, and send the power sequence to an accumulation power sequence maximum value search module 14;
a maximum value searching module 14, configured to receive the cross-correlation accumulated power sequence output by the cross-correlation accumulated power generating module, search for the maximum value of the cross-correlation accumulated power sequence, record the angle interval change number corresponding to the maximum value, and send the angle interval change number to the speaker azimuth angle estimating module 15;
and the speaker azimuth angle estimation module 15 is used for receiving the angle interval output by the maximum value search module 14 and calculating a speaker azimuth angle estimation value.
The microphone array 10 collects speaker signals and sends the speaker signals to the array noise spectrum estimation module 11.
According to yet another embodiment of the invention, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method described above.
In scenario 1 shown in fig. 2 and scenario 2 shown in fig. 3, respectively, the speaker orientation is estimated by using the method of the above embodiment, and the resulting speaker orientation curves are shown in fig. 8 and fig. 9, respectively. As can be seen from fig. 8 and 9, the method of the embodiment can estimate the orientation of the speaker more accurately in the presence of noise sources.
Scenario 1 shown in fig. 2: the speaker is 3 meters away from the array, and speaks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence without turning on a noise source.
Scenario 2 shown in fig. 3: the speaker is 3 meters away from the array, talks in three directions of 45 degrees, 90 degrees and 135 degrees in sequence, a noise source is turned on in the 200-degree direction and 3 meters, white noise is played, and the signal-to-noise ratio is 5 dB.
The above embodiments are merely illustrative of the technical ideas and features of the present invention, and are preferred embodiments, which are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All equivalent changes or modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (10)

1. A method for estimating a speaker's orientation with a microphone array, comprising the steps of:
s1, carrying out noise spectrum estimation on the signals collected by the microphone array to obtain a noise magnitude spectrum sequence;
s2, calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise magnitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array;
s3, calculating a cross-correlation accumulation power sequence of the frequency domain and the space domain orientation estimation of the microphone array according to the frequency domain orientation estimation factor;
and S4, searching the maximum value of the cross-correlation accumulation power sequence, and taking the angle corresponding to the maximum value as the direction estimation value of the speaker.
2. The method of claim 1, wherein the microphone array comprises P array elements, the microphone array collects a frame of signal with a frame length L, and the frame of signal with the length L is subjected to M-point fast fourier transform operation, wherein M represents the number of points of fast fourier transform;
one-frame signal x collected by array element pp,tThe definition is as follows:
xp,t=[xp,t,1 xp,t,2 … xp,t,L]T
wherein, P represents array element number, t is 1, 2, …, and P; t represents the number of frames, the tth frame is 1, 2, …, and T is the total number of frames; x is the number ofp,t,lRepresenting the time domain sample values of the signals of the p-th array element, the t-th frame and the l-th sampling moment;
the frame signal is subjected to M-point fast Fourier transform operationThen, a signal spectrum sequence X is obtainedp,tThe following were used:
Xp,t=[Xp,t,1 Xp,t,2 … xp,t,M]T
wherein, Xp,t,mAnd the signal spectrum sample values represent the p-th array element, the t-th frame and the m-th frequency sampling point.
3. The method according to claim 2, wherein step S2 specifically includes:
according to the signal frequency spectrum sequence Xp,tCalculating the power spectrum P of the signal collected by the array element Pp,t
Pp,t=[Pp,t,1 Pp,t,2 … Pp,t,M]T
Wherein the content of the first and second substances,
Figure FDA0003116245390000011
Pp,t,mrepresenting the signal power spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point,
Figure FDA0003116245390000012
the conjugate value of the signal spectrum of the p-th array element, the t-th frame and the M-th frequency sampling point is represented, and M is 1, 2, … and M;
noise amplitude spectrum sequence N according to array element pp,tNoise signal power spectrum Q for calculating array element pp,tThe following were used:
Qp,t=[Qp,t,1 Qp,t,2 … Qp,t,M]T
wherein Qp,t,m=Np,t,m×Np,t,m,Qp,t,mRepresenting the noise power spectrum value of the p-th array element, the t-th frame and the m-th frequency sampling point, Np,t,mRepresenting the noise spectrum values of the p-th array element, the t-th frame and the m-th frequency sampling point;
calculating the frequency domain orientation estimation factor PT between the array element p and the array element q at the mth frequency point according to the following formulap,q,t,m
Figure FDA0003116245390000021
Wherein, Pq,t,mSignal power spectrum value representing Q array element, t frame and m frequency sampling point, Qp,t,mRepresenting the noise power spectrum value, Q, of the p-th array element, t-th frame and m-th frequency sampling pointq,t,mAnd the noise power spectrum values of the q array element, the t frame and the m frequency sampling point are represented, and delta is a weighting factor.
4. The method according to claim 2 or 3, wherein step S3 specifically comprises:
assuming a cross-correlation spectrum G between array elements p and qp,q,tThe following were used:
Gp,q,t=[Gp,q,t,1 Gp,q,t,2 … Gp,q,t,M]T
wherein
Figure FDA0003116245390000022
Representing the signal spectrum conjugate values of the qth array element, the tth frame and the mth frequency sampling point;
estimating a factor PT from frequency domain orientationp,q,t,mCalculating the cross-correlation power PTG of the frequency domain azimuth estimation factorp,q,tThe following were used:
PTGp,q,t=[PTGp,q,t,1 PTGp,q,t,2 … PTGp,q,t,M]T
wherein PTGp,q,t,m=PTp,q,t,m×Gp,q,t,m
Assuming the azimuth angle theta of the speaker, the transmission delay from the speaker position to the array element p is defined as taup
Figure FDA0003116245390000023
Wherein r is the distance from the array element p to the geometric center of the microphone array, and thetapThe azimuth angle of an array element p is shown, and c is the sound velocity;
delay tau of transmission to array element p according to speaker positionpDefining the phase H of the array element p at the frequency point mp,mThe following were used:
Hp,m=exp(-j2πmfτp)
wherein the frequency domain sampling interval
Figure FDA0003116245390000024
fs is the system sampling rate;
cross-correlation phase H between array elements p and q at frequency point mp,q,mThe following were used:
Figure FDA0003116245390000025
cross-correlation accumulated power PTGF (power of cross correlation) for frequency domain orientation estimation by assuming that a frequency point sequence from M1 to M2 is selected from M frequency pointsp,q,tCalculating to obtain:
Figure FDA0003116245390000031
cross-correlation accumulated power PTGFS for frequency domain and space domain azimuth estimationtThe following were used:
Figure FDA0003116245390000032
5. the method according to claim 4, wherein in step S3, the speaker' S orientation is estimated using the signals of the time domain multiple data frames, and the cross-correlation cumulative power PTGFST (θ) of the time domain, frequency domain and spatial orientation estimates is calculated as follows:
Figure FDA0003116245390000033
6. the method according to claim 5, wherein in step S4, the angular intervals are Δ in degrees within a circle of 0 to 360 degreesθSequentially changing the azimuth theta of the speaker to calculate the cross-correlation accumulated power PTGFST (theta) corresponding to each azimuth, and assuming that the total point number of angle search is NθThe cross-correlation accumulated power PTGFST sequence is as follows:
PTGFST=[PTGFST(Δθ) PTGFST(2Δθ) … PTGFST(NθΔθ)]T
searching the maximum value of the cross-correlation accumulation power sequence, and finding out the angle corresponding to the maximum value, namely the direction estimation value of the speaker, as follows:
Figure FDA0003116245390000034
7. the method of claim 5, wherein the microphone array is a circular array, and r is the distance from the array element p to the center of the microphone array.
8. An apparatus for estimating a speaker's orientation using a microphone array, comprising a microphone array, a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor performs the program to perform the method of any one of claims 1 to 7 when processing speaker signals collected by the microphone array.
9. The apparatus of claim 7, wherein the processor comprises:
the array noise spectrum estimation module is used for receiving the signals collected by the microphone array and calculating a noise magnitude spectrum sequence;
the array frequency domain orientation estimation factor generation module is used for receiving the noise amplitude spectrum sequence and calculating frequency domain orientation estimation factors among array elements of the microphone array according to the noise amplitude spectrum sequence and the frequency domain sequence of the signals collected by the microphone array;
the cross-correlation accumulation power generation module is used for receiving the frequency domain orientation estimation factors output by the array frequency domain orientation estimation factor generation module and calculating a cross-correlation accumulation power sequence by combining a time domain, a frequency domain and a space domain;
the maximum value searching module is used for receiving the cross-correlation accumulation power sequence output by the cross-correlation accumulation power generating module, searching the maximum value of the cross-correlation accumulation power sequence and recording the angle interval corresponding to the maximum value;
and the speaker azimuth angle estimation module is used for receiving the angle interval output by the maximum value search module and calculating the speaker azimuth angle estimation value.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202110664316.0A 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array Active CN113470682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110664316.0A CN113470682B (en) 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110664316.0A CN113470682B (en) 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array

Publications (2)

Publication Number Publication Date
CN113470682A true CN113470682A (en) 2021-10-01
CN113470682B CN113470682B (en) 2023-11-24

Family

ID=77869967

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110664316.0A Active CN113470682B (en) 2021-06-16 2021-06-16 Method, device and storage medium for estimating speaker azimuth by microphone array

Country Status (1)

Country Link
CN (1) CN113470682B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006194700A (en) * 2005-01-12 2006-07-27 Hiroshima Industrial Promotion Organization Sound source direction estimation system, sound source direction estimation method and sound source direction estimation program
JP2007006253A (en) * 2005-06-24 2007-01-11 Sony Corp Signal processor, microphone system, and method and program for detecting speaker direction
WO2008041878A2 (en) * 2006-10-04 2008-04-10 Micronas Nit System and procedure of hands free speech communication using a microphone array
US20110305345A1 (en) * 2009-02-03 2011-12-15 University Of Ottawa Method and system for a multi-microphone noise reduction
CN104142492A (en) * 2014-07-29 2014-11-12 佛山科学技术学院 SRP-PHAT multi-source spatial positioning method
CN106199607A (en) * 2016-06-29 2016-12-07 北京捷通华声科技股份有限公司 The Sounnd source direction localization method of a kind of microphone array and device
CN109188362A (en) * 2018-09-03 2019-01-11 中国科学院声学研究所 A kind of microphone array auditory localization signal processing method
CN110488223A (en) * 2019-07-05 2019-11-22 东北电力大学 A kind of sound localization method
CN112216295A (en) * 2019-06-25 2021-01-12 大众问问(北京)信息科技有限公司 Sound source positioning method, device and equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006194700A (en) * 2005-01-12 2006-07-27 Hiroshima Industrial Promotion Organization Sound source direction estimation system, sound source direction estimation method and sound source direction estimation program
JP2007006253A (en) * 2005-06-24 2007-01-11 Sony Corp Signal processor, microphone system, and method and program for detecting speaker direction
WO2008041878A2 (en) * 2006-10-04 2008-04-10 Micronas Nit System and procedure of hands free speech communication using a microphone array
US20110305345A1 (en) * 2009-02-03 2011-12-15 University Of Ottawa Method and system for a multi-microphone noise reduction
CN104142492A (en) * 2014-07-29 2014-11-12 佛山科学技术学院 SRP-PHAT multi-source spatial positioning method
CN106199607A (en) * 2016-06-29 2016-12-07 北京捷通华声科技股份有限公司 The Sounnd source direction localization method of a kind of microphone array and device
CN109188362A (en) * 2018-09-03 2019-01-11 中国科学院声学研究所 A kind of microphone array auditory localization signal processing method
CN112216295A (en) * 2019-06-25 2021-01-12 大众问问(北京)信息科技有限公司 Sound source positioning method, device and equipment
CN110488223A (en) * 2019-07-05 2019-11-22 东北电力大学 A kind of sound localization method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JONGHO HAN: "Tracking of a moving object by following the sound source", 《2012 IEEE/ASME INTERNATIONAL CONFERENCE ON ADVANCED INTELLIGENT MECHATRONICS (AIM)》 *
张坤: "基于嵌入式的麦克风阵列声源定位系统算法研究", 《中国优秀硕士学位论文全文数据库》 *

Also Published As

Publication number Publication date
CN113470682B (en) 2023-11-24

Similar Documents

Publication Publication Date Title
US10979805B2 (en) Microphone array auto-directive adaptive wideband beamforming using orientation information from MEMS sensors
US9837099B1 (en) Method and system for beam selection in microphone array beamformers
US7254241B2 (en) System and process for robust sound source localization
JP2008079256A (en) Acoustic signal processing apparatus, acoustic signal processing method, and program
CN110534126B (en) Sound source positioning and voice enhancement method and system based on fixed beam forming
CN103308889A (en) Passive sound source two-dimensional DOA (direction of arrival) estimation method under complex environment
CN108172231A (en) A kind of dereverberation method and system based on Kalman filtering
CN111624553A (en) Sound source positioning method and system, electronic equipment and storage medium
CN111866665B (en) Microphone array beam forming method and device
CN109859769B (en) Mask estimation method and device
Zhang et al. Robust DOA Estimation Based on Convolutional Neural Network and Time-Frequency Masking.
Nesta et al. A flexible spatial blind source extraction framework for robust speech recognition in noisy environments
KR20110021419A (en) Apparatus and method for reducing noise in the complex spectrum
Beit-On et al. Speaker localization using the direct-path dominance test for arbitrary arrays
CN113870893A (en) Multi-channel double-speaker separation method and system
CN116312602B (en) Voice signal beam forming method based on interference noise space spectrum matrix
CN110890099A (en) Sound signal processing method, device and storage medium
CN113470682B (en) Method, device and storage medium for estimating speaker azimuth by microphone array
JP6182169B2 (en) Sound collecting apparatus, method and program thereof
CN115932733A (en) Sound source positioning and voice enhancing method and device
CN117037836B (en) Real-time sound source separation method and device based on signal covariance matrix reconstruction
Tiantian et al. Underwater Acoustic Sensing with Rational Orthogonal Wavelet Pulse and Auditory Frequency Cepstral Coefficient-Based Feature Extraction
Matsuo et al. Estimating DOA of multiple speech signals by improved histogram mapping method
CN111933182B (en) Sound source tracking method, device, equipment and storage medium
CN113808606B (en) Voice signal processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant