CN112799017B

CN112799017B - Sound source positioning method, sound source positioning device, storage medium and electronic equipment

Info

Publication number: CN112799017B
Application number: CN202110369681.9A
Authority: CN
Inventors: 王克彦; 俞鸣园
Original assignee: Zhejiang Huachuang Video Signal Technology Co Ltd
Current assignee: Zhejiang Huachuang Video Signal Technology Co Ltd
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-07-09
Anticipated expiration: 2041-04-07
Also published as: CN112799017A

Abstract

The invention discloses a sound source positioning method, a sound source positioning device, a storage medium and electronic equipment, wherein the positioning method comprises the following steps: determining that the microphone array is atkTarget frequency domain information of a target audio signal received by a time frame; according to the target frequency domain information, determiningDetermining an inverse of a target covariance matrix of the target audio signal; determining a spatial spectrum corresponding to a preset angle according to an inverse matrix of a target covariance matrix and a time delay matrix corresponding to the preset angle aiming at each preset angle in a plurality of preset angles, wherein the inverse matrix of the target covariance matrix is a self-conjugate matrix, the time delay matrix corresponding to the preset angle is obtained by reconstructing a guide vector corresponding to the preset angle, and the time delay matrix corresponding to the preset angle is a self-conjugate matrix and is a Toeplitz matrix; and determining a target angle from the plurality of preset angles according to the spatial spectrum corresponding to each preset angle, and positioning a sound source corresponding to the target audio signal according to the target angle. Thus, the complexity of sound source localization is reduced.

Description

Sound source positioning method, sound source positioning device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a sound source positioning method and apparatus, a storage medium, and an electronic device.

Background

Direction of Arrival (DOA) refers to the Direction of Arrival of a spatial signal, and at present, a microphone array-based Direction of Arrival estimation technology is widely applied in the fields of audio and video conferences, monitoring and recognition, multimedia systems, intelligent sound boxes and the like, and is an important subject in the field of human-computer interaction. In an actual application scene, the subsequent voice enhancement effect and human-computer interaction experience effect can be influenced by the DOA of the sound source calculated by the microphone array, so that certain requirements are provided for the accuracy of the DOA calculated by the microphone array. The DOA estimation technology is affected by various factors such as voice broadband characteristics, microphone performance, environmental noise and room reverberation, so that the positioning accuracy is reduced, and therefore the DOA technology of the microphone array is required to have certain robustness. Meanwhile, in consideration of the problems of computing resources, efficiency and the like of different hardware platforms in the actual deployment process, the computing complexity of the DOA is reduced as much as possible, and the power consumption occupied by the DOA is reduced.

Disclosure of Invention

The present disclosure provides a sound source localization method, a sound source localization apparatus, a storage medium, and an electronic device, which can reduce the complexity of sound source localization and improve the efficiency of sound source localization.

In order to achieve the above object, in a first aspect, the present disclosure provides a sound source localization method, the method comprising:

determining that the microphone array is atkTargeting of a target audio signal received by a time frameFrequency domain information, wherein the microphone array is composed of a plurality of microphones arranged according to a preset spatial topology,kis a positive integer;

determining an inverse matrix of a target covariance matrix of the target audio signal according to the target frequency domain information;

determining a spatial spectrum corresponding to a preset angle according to an inverse matrix of a target covariance matrix and a time delay matrix corresponding to the preset angle for each preset angle in a plurality of preset angles, wherein elements in the time delay matrix corresponding to the preset angle comprise frequency domain information of relative time delay between audio signals received by every two microphones in a microphone array, the inverse matrix of the target covariance matrix is a self-conjugate matrix, the time delay matrix corresponding to the preset angle is obtained by reconstructing a steering vector corresponding to the preset angle, and the time delay matrix corresponding to the preset angle is a self-conjugate matrix and is a Toeplitz matrix;

and determining a target angle from the plurality of preset angles according to the spatial spectrum corresponding to each preset angle, and positioning a sound source corresponding to the target audio signal according to the target angle.

Optionally, the determining the microphone array is in the second placekTarget frequency domain information of a target audio signal received by a time frame, comprising:

determining, for each microphone in the array of microphones, that the microphone is at a secondkTime domain information of the audio signal received by the time frame;

according to each microphone being at the secondkTime domain information of the audio signal received by the time frame determines the target frequency domain information.

Alternatively,kis a positive integer greater than or equal to 2; the determining an inverse matrix of a target covariance matrix of the target audio signal according to the target frequency domain information includes:

determining an initial covariance matrix of the target audio signal according to the target frequency domain information;

reconstructing the initial covariance matrix by adopting a diagonal loading technology to determine the target covariance matrix;

according to Sherman-Morrison-Woodbury algorithm, the target frequency domain information and the first frequency domain informationkForgetting factor corresponding to time frame and microphone array in the second placek-1 inverse of a covariance matrix of the audio signal received at the time frame, determining an inverse of the target covariance matrix, whereinkThe forgetting factor corresponding to the time frame characterizes that the position of the sound source corresponding to the target audio signal is in the second position relative to the microphone arrayk-1 variation of the position of the sound source corresponding to the audio signal received in the time frame.

Optionally, the target frequency domain information, the second frequency domain information, according to Sherman-Morrison-Woodbury algorithmkForgetting factor corresponding to time frame and microphone array in the second placek-1 inverse of a covariance matrix of the audio signal received at the time frame, the inverse of the target covariance matrix being determined, comprising:

determining an inverse of the target covariance matrix by:

wherein the content of the first and second substances,

a target covariance matrix is represented as a function of,

an inverse matrix representing the target covariance matrix,ASMWthe Adaptive Sherman-Morrison-Woodbury algorithm is shown, is an Adaptive Sherman-Morrison-Woodbury algorithm based on a forgetting factor,fthe point of the frequency is represented by,

is shown askThe forgetting factor corresponding to the time frame,

representing microphone arrays ink1 inverse of a covariance matrix of the audio signal received at the time frame,X（k，f) Which represents the target frequency-domain information,Hrepresenting the conjugate transpose of the matrix.

Alternatively,kis a positive integer greater than or equal to 3; is determined by the following formulakForgetting factor corresponding to time frame:

wherein the content of the first and second substances,

is shown askThe forgetting factor corresponding to the time frame,

is shown ask-1 a forgetting factor for the time frame,μin order to preset the iteration step size,

is the absolute value of the difference between the first angle and the second angle, wherein the first angle is the microphone array and the microphone array at the second anglek-1 relative angle between sound sources of audio signals received by a time frame, the second angle being the microphone array and the microphone array at the first anglek-2 relative angles between sound sources of the audio signals received in the time frames.

Optionally, the time delay matrix corresponding to the preset angle is obtained according to an inner product of a steering vector corresponding to the preset angle and a conjugate transpose of the steering vector.

Optionally, the determining, according to the inverse matrix of the target covariance matrix and the time delay matrix corresponding to the preset angle, a spatial spectrum corresponding to the preset angle includes:

obtaining a first vector according to the inverse matrix of the target covariance matrix, wherein elements in the first vector comprise the sum of elements on a main diagonal of the inverse matrix of the target covariance matrix and the sum of elements on a diagonal parallel to the main diagonal in an upper triangular matrix of the inverse matrix of the target covariance matrix;

extracting a first row of a time delay matrix corresponding to the preset angle to obtain a second vector;

and determining the space spectrum corresponding to the preset angle according to the inner product of the first vector and the second vector.

Optionally, the determining a target angle from the plurality of preset angles according to the spatial spectrum corresponding to each preset angle includes:

determining a spatial spectrum maximum value in a spatial spectrum matrix corresponding to each preset angle;

and taking the preset angle corresponding to the maximum value of the spatial spectrum as the target angle.

In a second aspect, the present disclosure provides a sound source localization apparatus, the apparatus comprising:

a frequency domain information determination module for determining that the microphone array is at the secondkTarget frequency domain information of a target audio signal received by a time frame, wherein a microphone array is composed of a plurality of microphones arranged according to a preset spatial topology,kis a positive integer;

an inverse matrix determining module, configured to determine an inverse matrix of a target covariance matrix of the target audio signal according to the target frequency domain information;

a spatial spectrum determining module, configured to determine, for each preset angle in a plurality of preset angles, a spatial spectrum corresponding to the preset angle according to an inverse matrix of the target covariance matrix and a delay matrix corresponding to the preset angle, where an element in the delay matrix corresponding to the preset angle includes frequency domain information of relative delay between audio signals received by each two microphones in the microphone array, the inverse matrix of the target covariance matrix is a self-conjugate matrix, the delay matrix corresponding to the preset angle is obtained by reconstructing a steering vector corresponding to the preset angle, and the delay matrix corresponding to the preset angle is a self-conjugate matrix and is a toeplitz matrix;

and the positioning module is used for determining a target angle from the plurality of preset angles according to the spatial spectrum corresponding to each preset angle and positioning a sound source corresponding to the target audio signal according to the target angle.

Optionally, the frequency domain information determining module includes:

a first determining sub-module for determining, for each microphone in the microphone array, that the microphone is at a second locationkTime domain information of the audio signal received by the time frame;

a second determining submodule for determining whether each microphone is at the second positionkTime domain information of the audio signal received by the time frame determines the target frequency domain information.

Alternatively,kis a positive integer greater than or equal to 2; the inverse matrix determination module includes:

a third determining submodule, configured to determine an initial covariance matrix of the target audio signal according to the target frequency domain information;

the reconstruction submodule is used for reconstructing the initial covariance matrix by adopting a diagonal loading technology and determining the target covariance matrix;

a fourth determining submodule for determining the target frequency domain information according to Sherman-Morrison-Woodbury algorithm, and the second determining submodulekForgetting factor corresponding to time frame and microphone array in the second placek-1 inverse of a covariance matrix of the audio signal received at the time frame, determining an inverse of the target covariance matrix, whereinkThe forgetting factor corresponding to the time frame characterizes that the position of the sound source corresponding to the target audio signal is in the second position relative to the microphone arrayk-1 variation of the position of the sound source corresponding to the audio signal received in the time frame.

Optionally, the fourth determining submodule is configured to determine an inverse matrix of the target covariance matrix by:

wherein the content of the first and second substances,

a target covariance matrix is represented as a function of,

is shown askThe forgetting factor corresponding to the time frame,

wherein the content of the first and second substances,

is shown askThe forgetting factor corresponding to the time frame,

is the absolute value of the difference between the first angle and the second angle, wherein the first angle is the microphone array and the microphone array at the second anglek-1 time frame receptionRelative angle between sound sources of the incoming audio signals, the second angle being the microphone array and the microphone array at the first anglek-2 relative angles between sound sources of the audio signals received in the time frames.

Optionally, the spatial spectrum determination module comprises:

the vector determination submodule is used for obtaining a first vector according to the inverse matrix of the target covariance matrix, wherein elements in the first vector comprise the sum of elements on a main diagonal of the inverse matrix of the target covariance matrix and the sum of elements on a diagonal parallel to the main diagonal in an upper triangular matrix of the inverse matrix of the target covariance matrix;

the extraction submodule is used for extracting a first row of the time delay matrix corresponding to the preset angle to obtain a second vector;

and the spatial spectrum determining submodule is used for determining the spatial spectrum corresponding to the preset angle according to the inner product of the first vector and the second vector.

Optionally, the positioning module includes:

the fifth determining submodule is used for determining the maximum value of the spatial spectrum in the spatial spectrum matrix corresponding to each preset angle;

and the sixth determining submodule is used for taking the preset angle corresponding to the maximum value of the spatial spectrum as the target angle.

In a third aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method provided by the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising: a memory having a computer program stored thereon; a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.

According to the technical scheme, the spatial spectrum corresponding to the preset angle is determined according to the inverse matrix of the target covariance matrix and the time delay matrix corresponding to the preset angle, the target angle is determined from the plurality of preset angles according to the spatial spectrum corresponding to each preset angle, and therefore the sound source corresponding to the target audio signal is positioned according to the target angle. The inverse matrix of the target covariance matrix is a self-conjugate matrix, the time delay matrix corresponding to the preset angle is obtained by reconstructing a guide vector corresponding to the preset angle, the time delay matrix corresponding to the preset angle is a self-conjugate matrix and a Toeplitz matrix, and according to the properties of the self-conjugate matrix and the Toeplitz matrix, the calculation complexity is further simplified, the complexity of calculating the spatial spectrum corresponding to each preset angle is reduced, so that the complexity of sound source positioning can be reduced, and the power consumption of the microphone array for sound source positioning is reduced. In addition, the efficiency of sound source positioning is improved, and the voice enhancement effect and the man-machine interaction experience effect of the microphone array can be improved in interactive scenes such as voice conferences and video conferences.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:

fig. 1 is a flow chart illustrating a sound source localization method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating a method of determining an inverse of a target covariance matrix of a target audio signal from target frequency domain information according to an example embodiment.

FIG. 3 is a block diagram illustrating a sound source localization arrangement according to an exemplary embodiment.

FIG. 4 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

In a related sound source positioning algorithm, the traditional beam forming positioning algorithm has the defects of main lobe width and insufficient positioning precision; the calculation amount of the time delay estimation positioning algorithm is low, which is beneficial to the quick realization of the algorithm, but the time delay estimation algorithm has the defects of insignificant peak value of the correlation peak under low signal-to-noise ratio, poor algorithm robustness, limitation of factors such as signal sampling rate and microphone array type, and lower positioning precision; the super-resolution estimation can be realized by constructing orthogonal signal subspace and noise subspace through eigenvalue decomposition by using a high-resolution spectrum estimation technology such as a MUSIC algorithm, but the MUSIC algorithm needs certain priori knowledge on signals and noise, and meanwhile, the calculation complexity is higher; the MVDR (Minimum Variance Distortionless response) algorithm obtains a DOA estimation result of a sound source by constructing a covariance matrix of an array frequency domain signal and calculating an inverse matrix, multiplying a guide vector to obtain a spatial power spectrum, and searching a spectrum peak for the spatial spectrum, so that the positioning accuracy of the algorithm is maintained, and meanwhile, the method has good robustness, but the MVDR algorithm also has higher complexity, and the calculation amount is mainly concentrated at the positions of covariance matrix inversion, power spectrum calculation and spectrum peak search. Therefore, in the related art, when sound source localization is performed, the calculation complexity is high, and the efficiency of sound source localization is low.

In view of this, the present disclosure provides a sound source positioning method, a sound source positioning device, a storage medium, and an electronic device, which can reduce the complexity of sound source positioning and improve the efficiency of sound source positioning.

The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a flowchart illustrating a sound source localization method according to an exemplary embodiment, which may be applied to an electronic device having a processing capability, and as shown in fig. 1, may include S101 to S104.

In S101, it is determined that the microphone array is at the firstkTarget frequency domain information of a target audio signal received by a time frame.

Wherein the microphone array is formed by a plurality of microphones arranged according to a preset spatial topology structureThe preset spatial topological structure can be formed by that the distance between any two adjacent microphones is equal.kPositive integer, each microphone of the microphone array can receive audio signal, the microphone array is at the secondkThe target audio signal received by the time frame may mean that each microphone is at the second positionkThe time frames each receive a combined signal of the audio signals.

Optionally, an exemplary implementation of step S101 may be: determining, for each microphone in the array of microphones, that the microphone is at a secondkTime domain information of the audio signal received by the time frame; according to each microphone being at the secondkTime domain information of the audio signal received by the time frame determines the target frequency domain information.

The audio signal received by each microphone may be preprocessed, for example, by performing VAD (Voice Activity Detection) on the audio signal, performing framing processing and windowing processing on the audio signal, and performing fourier transform on time domain information of the audio signal to obtain frequency domain information of the audio signal. The m-th microphone in the microphone array iskThe frequency domain information of the audio signal received by the time frame may be determined by the following expression (1):

X _m（k，f）= h _m（k，f）S（k，f）+W _m（k，f）（1）

wherein the content of the first and second substances,fthe point of the frequency is represented by,X _m（k，f) Indicating that the m-th microphone is atkFrequency domain information of the audio signal received by the time frame,S（k，f) Is shown askFrequency domain information of the time-frame original sound source signal,W _m（k，f) Is shown askThe noise of the time frame is such that,h _m（k，f) Is shown inkThe propagation path function of the time frame sound source to the m-th microphone,h _m（k，f) Can be determined by the following expression (2):

（2）

wherein the content of the first and second substances,jthe number of the imaginary numbers is represented,d _mrepresenting the separation of the m-th microphone from the microphone array origin,cwhich is indicative of the speed of sound,αrepresenting the target angle.

The microphone array is arranged atkThe target frequency domain information of the target audio signal received by the time frame may be determined by the following expression (3):

（3）

wherein the content of the first and second substances,X（k，f) Representing microphone arrays inkTarget frequency domain information of a target audio signal received by a time frame, M representing the number of microphones comprised in the microphone array,Trepresenting a matrix transposition.

In S102, an inverse matrix of a target covariance matrix of the target audio signal is determined according to the target frequency domain information.

Fig. 2 is a flowchart illustrating a method of determining an inverse matrix of a target covariance matrix of a target audio signal according to target frequency domain information according to an exemplary embodiment, and as shown in fig. 2, S102 may include S1021 to S1023.

In S1021, an initial covariance matrix of the target audio signal is determined according to the target frequency domain information.

For example, the initial covariance matrix of the target audio signal may be determined by the following equation (4):

（4）

wherein the content of the first and second substances,R（k，f) An initial covariance matrix representing the target audio signal,Nto representThe number of preset time frames within a preset time duration,nis shown asnThe number of time frames is,X（k-n，f) Representing microphone arrays ink-nFrequency domain information of the audio signal received by the time frame,Eit is shown that it is desirable to,Hrepresenting the conjugate transpose of the matrix.

In S1022, the initial covariance matrix is reconstructed by using a diagonal loading technique, and a target covariance matrix is determined.

Reconstructing an initial covariance matrix using diagonal loading techniquesR（k，f) The resulting target covariance matrix can be expressed by the following expression (5):

（5）

wherein the content of the first and second substances,

representing a target covariance matrix, lambda represents a preset diagonal loading factor,Irepresenting an identity matrix.

In S1023, according to Sherman-Morrison-Woodbury algorithm, target frequency domain information, the secondkForgetting factor corresponding to time frame and microphone array on the secondk-1 inverse of the covariance matrix of the audio signal received at the time frame, determining the inverse of the target covariance matrix.kIs a positive integer greater than or equal to 2.

In the disclosure, considering that in practical situations, the position of a sound source relative to a microphone array is not fixed and is possibly changed in relative orientation, an Adaptive forgetting factor is introduced on the basis of an SMW (Sherman-Morrison-Woodbury) algorithm, and an ASMW (Adaptive Sherman-Morrison-Woodbury) generalized algorithm is proposed to be used for calculating an inverse covariance matrix,ASMWthe algorithm is an adaptive Sherman-Morrison-Woodbury algorithm based on a forgetting factor, wherein the specific calculation mode of the SMW algorithm can refer to the related technology. Wherein, the firstkThe forgetting factor corresponding to the time frame characterizes the position of the sound source corresponding to the target audio signal relative to saidThe microphone array is arranged atk-1 variation of the position of the sound source corresponding to the audio signal received in the time frame.

Illustratively, the inverse of the target covariance matrix is determined by the following equation (6):

（6）

wherein the content of the first and second substances,

a target covariance matrix is represented as a function of,

an inverse matrix representing the target covariance matrix,ASMWthe Adaptive Sherman-Morrison-Woodbury algorithm is shown, and is an Adaptive Sherman-Morrison-Woodbury algorithm based on a forgetting factor, the inverse Matrix of the target covariance Matrix is an autoconjugate Matrix, which is also called Hermitian Matrix,fthe point of the frequency is represented by,

is shown askThe forgetting factor corresponding to the time frame,

Illustratively, the first determination can be made by the following formula (7)kForgetting factor corresponding to time frame:

（7）

wherein the content of the first and second substances,

is shown askThe forgetting factor corresponding to the time frame,

is the absolute value of the difference between the first angle and the second angle.kIs a positive integer greater than or equal to 3, the first angle is the microphone array and the microphone array is at the second anglek-1 relative angle between sound sources of audio signals received by a time frame, the second angle being the microphone array and the microphone array at the first anglek-2 relative angles between sound sources of the audio signals received in the time frames. Also, in the case where the azimuth of the sound source with respect to the microphone array is not changed, i.e. the microphone array

，

。

In this way, considering that in an actual situation, the position of the sound source relative to the microphone array is not fixed and is likely to change in relative orientation, an adaptive forgetting factor is introduced, an inverse matrix of the target covariance matrix is calculated by the ASMW algorithm, and robustness of inverse matrix calculation of the covariance matrix of the sound source is improved under the condition that the sound source moves.

In S103, for each preset angle in the plurality of preset angles, a spatial spectrum corresponding to the preset angle is determined according to the inverse matrix of the target covariance matrix and the time delay matrix corresponding to the preset angle.

The elements in the delay matrix corresponding to the preset angle comprise frequency domain information of relative delay between audio signals received by every two microphones in the microphone array, the inverse matrix of the target covariance matrix is a self-conjugate matrix, the delay matrix corresponding to the preset angle is obtained by reconstructing a steering vector corresponding to the preset angle, and the delay matrix corresponding to the preset angle is a self-conjugate matrix and a Toeplitz matrix.

Wherein the preset angle may be first determined by the following expression (8)θCorresponding steering vector:

（8）

wherein the content of the first and second substances,a（f，θ) Indicating a preset angleθThe corresponding steering vector is set to the direction of the steering vector,Mrepresenting the number of microphones included in the microphone array.

The method for reconstructing the steering vector corresponding to the preset angle to obtain the time delay matrix corresponding to the preset angle may be that the time delay matrix corresponding to the preset angle is obtained according to an inner product of the steering vector corresponding to the preset angle and a conjugate transpose of the steering vector.

Illustratively, the preset angle is calculated using the following expression (9)θCorresponding spatial spectrum:

（9）

wherein the content of the first and second substances,P _ASMW-MVDR(f，θ) Indicating a preset angleθThe corresponding spatial spectrum of the spectrum,dotthe inner product operation is represented by the following operation,A（f，θ) Indicating a preset angleθCorresponding time delay matrix, accessible to preset angleθCorresponding guide vectora（f，θ) The reconstruction is carried out to obtain the compound,A（f，θ) Can be determined by the following expression (10):

（10）

wherein the content of the first and second substances,A（f，θ) Is a self-conjugate matrix of M x M,d _mnrepresenting the relative distance between the mth microphone and the nth microphone,d _M1andd _M1all represent the relative distance between the 1 st microphone and the Mth microphone, and for the microphone array which is arranged according to the preset spatial topology structure, the distance between any two adjacent microphones is equal, namely the elements on each diagonal line which is parallel to the main diagonal line are equal, so that the elements on each diagonal line are equalA（f，θ) And also an M × M Toeplitz Matrix (Toeplitz Matrix), so the delay Matrix is a self-conjugate Matrix and a Toeplitz Matrix. To be provided withA（f，θ) Element (1) of

For example, the element characterizes if the relative angle between the sound source corresponding to the target audio signal and the microphone array is the predetermined angleθFrequency domain information of the relative time delay between the audio signals received by the 1 st microphone and the mth microphone in the microphone array, respectively.

Determining the spatial spectrum corresponding to the preset angle according to the inverse matrix of the target covariance matrix and the time delay matrix corresponding to the preset angle may include: obtaining a first vector according to the inverse matrix of the target covariance matrix, wherein elements in the first vector comprise the sum of elements on a main diagonal of the inverse matrix of the target covariance matrix and the sum of elements on a diagonal parallel to the main diagonal in an upper triangular matrix of the inverse matrix of the target covariance matrix; extracting a first row of a time delay matrix corresponding to the preset angle to obtain a second vector; and determining the space spectrum corresponding to the preset angle according to the inner product of the first vector and the second vector.

In a matrix

AndA（f，θ) In the inner product operation of (3), the lower triangular matrix operation result S1 and the upper triangular matrix operation result S2 satisfy S1= S2^*Therefore, if the inner product is calculated by multiplying and summing the corresponding points of each element, the lower triangular matrix and the upper triangular matrix produce repeated operations. Will matrix

AndA（f，θ) Is equivalent to the inner product of a first vector and a second vector, wherein,

the elements on the middle main diagonal are added at the same time

The elements on the diagonal line parallel to the main diagonal line in the upper triangular matrix are added to obtain a first vector

Extracting matrixA（f，θ) Second vector obtained from the first row ofA1（f，θ) The preset angle can be obtained according to the inner product of the first vector and the second vectorθA corresponding spatial spectrum in which, among other things,A1（f，θ) As shown in the following expression (11):

（11）

equation (9) may be equivalent to the following expression (12):

（12）

thus, in calculating the preset angleθFor corresponding spatial spectrum, only vector needs to be calculated

And vectorA1（f，θ) The inner product of (2) can greatly reduce the calculation complexity, and can obtain an accurate result and improve the calculation efficiency. Compared with the traditional calculation mode, the multiplication operation number is equivalent to the traditional 1/(2M +1), and the addition operation number is equivalent to the traditional 25%.

In S104, a target angle is determined from the plurality of preset angles according to the spatial spectrum corresponding to each preset angle, and a sound source corresponding to the target audio signal is positioned according to the target angle.

A plurality of preset angles may be preset. The embodiment of determining the target angle from the plurality of preset angles may be: determining a spatial spectrum maximum value in a spatial spectrum matrix corresponding to each preset angle; and taking the preset angle corresponding to the maximum value of the spatial spectrum as the target angle. For example, if the spatial spectrum corresponding to the preset angle of 20 degrees is the maximum spatial spectrum, 20 degrees may be used as the target angle, which is the calculated DOA result of the sound source.

Based on the same inventive concept, the present disclosure also provides a sound source localization apparatus, and fig. 3 is a block diagram of a sound source localization apparatus according to an exemplary embodiment, as shown in fig. 3, the apparatus 300 may include:

a frequency domain information determining module 301 for determining that the microphone array is at the second placekTarget frequency domain information of a target audio signal received by a time frame, wherein a microphone array is composed of a plurality of microphones arranged according to a preset spatial topology,kis a positive integer;

an inverse matrix determining module 302, configured to determine an inverse matrix of a target covariance matrix of the target audio signal according to the target frequency domain information;

a spatial spectrum determining module 303, configured to determine, for each preset angle in a plurality of preset angles, a spatial spectrum corresponding to the preset angle according to an inverse matrix of the target covariance matrix and a delay matrix corresponding to the preset angle, where an element in the delay matrix corresponding to the preset angle includes frequency domain information of relative delay between audio signals received by each two microphones in the microphone array, the inverse matrix of the target covariance matrix is a self-conjugate matrix, the delay matrix corresponding to the preset angle is obtained by reconstructing a steering vector corresponding to the preset angle, and the delay matrix corresponding to the preset angle is a self-conjugate matrix and is a toeplitz matrix;

a positioning module 304, configured to determine a target angle from the multiple preset angles according to the spatial spectrums corresponding to the preset angles, and position a sound source corresponding to the target audio signal according to the target angle.

Optionally, the frequency domain information determining module 301 includes:

Alternatively,kis a positive integer greater than or equal to 2; the inverse matrix determination module 302 includes:

wherein the content of the first and second substances,

a target covariance matrix is represented as a function of,

is shown askThe forgetting factor corresponding to the time frame,

wherein the content of the first and second substances,

is shown askThe forgetting factor corresponding to the time frame,

Optionally, the spatial spectrum determination module 303 includes:

Optionally, the positioning module 304 includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 4 is a block diagram illustrating an electronic device 700 according to an example embodiment. As shown in fig. 4, the electronic device 700 may include: a processor 701 and a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an input/output (I/O) interface 704, and a communication component 705.

The processor 701 is configured to control the overall operation of the electronic device 700, so as to complete all or part of the steps in the sound source localization method. The memory 702 is used to store various types of data to support operation at the electronic device 700, such as instructions for any application or method operating on the electronic device 700 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and the like. The Memory 702 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia components 703 may include screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 702 or transmitted through the communication component 705. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 704 provides an interface between the processor 701 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 705 is used for wired or wireless communication between the electronic device 700 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 705 may thus include: Wi-Fi module, Bluetooth module, NFC module, etc.

In an exemplary embodiment, the electronic Device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the sound source localization method described above.

In another exemplary embodiment, a computer readable storage medium comprising program instructions which, when executed by a processor, implement the steps of the sound source localization method described above is also provided. For example, the computer readable storage medium may be the above-mentioned memory 702 comprising program instructions executable by the processor 701 of the electronic device 700 to perform the above-mentioned sound source localization method.

The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.

It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.

In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims

1. A sound source localization method, characterized in that the method comprises:

determining that the microphone array is atkTarget frequency domain information of a target audio signal received by a time frame, wherein a microphone array is composed of a plurality of microphones arranged according to a preset spatial topology,kis a positive integer;

2. The method of claim 1, wherein the determining the microphone array is at a first locationkTarget frequency domain information of a target audio signal received by a time frame, comprising:

3. The method of claim 1,kis a positive integer greater than or equal to 2; the determining an inverse matrix of a target covariance matrix of the target audio signal according to the target frequency domain information includes:

4. The method according to claim 3, wherein the target frequency domain information, the first time domain information, is based on Sherman-Morrison-Woodbury algorithmkForgetting factor corresponding to time frame and microphone array in the second placek1 time frame of the received audio signalAn inverse of a covariance matrix, the determining the inverse of the target covariance matrix comprising:

determining an inverse of the target covariance matrix by:

wherein the content of the first and second substances,

a target covariance matrix is represented as a function of,

an inverse matrix representing the target covariance matrix,ASMWshows an Adaptive Sherman-Morrison-Woodbury algorithm, which is an Adaptive Sherman-Morrison-Woodbury algorithm based on a forgetting factor,fthe point of the frequency is represented by,

is shown askThe forgetting factor corresponding to the time frame,

5. The method of claim 3,kis a positive integer greater than or equal to 3; is determined by the following formulakForgetting factor corresponding to time frame:

wherein the content of the first and second substances,

is shown askThe forgetting factor corresponding to the time frame,

6. The method of claim 1, wherein the delay matrix corresponding to the predetermined angle is obtained according to an inner product of a steering vector corresponding to the predetermined angle and a conjugate transpose of the steering vector.

7. The method according to claim 1, wherein the determining the spatial spectrum corresponding to the preset angle according to the inverse matrix of the target covariance matrix and the time delay matrix corresponding to the preset angle comprises:

8. The method according to claim 1, wherein the determining the target angle from the plurality of preset angles according to the spatial spectrum corresponding to each preset angle comprises:

9. A sound source localization apparatus, characterized in that the apparatus comprises:

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

11. An electronic device, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 8.