CN111537955A

CN111537955A - Multi-sound-source positioning method and device based on spherical microphone array

Info

Publication number: CN111537955A
Application number: CN202010255782.9A
Authority: CN
Inventors: 戴玮
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2020-08-14

Abstract

The invention discloses a multi-sound-source positioning method and a device based on a spherical microphone array, wherein the multi-sound-source comprises D sound sources, and the method comprises the following steps: s1, acquiring a voice signal transmitted by the spherical microphone array; s2, performing framing windowing and short-time Fourier transform processing on the voice signal to obtain a time-frequency domain voice signal; s3, performing spherical Fourier transform processing on the time-frequency domain voice signal to obtain a spherical harmonic domain voice signal; s4, constructing a sparse dictionary for the spherical harmonic domain voice signals according to the weight of the maximum directional beam former and the preset first grid precision; and S5, calculating to obtain first positions of the D sound sources in the sparse dictionary by using a sparse Bayes learning method and an expectation maximization method, and improving the resolution of sparse positioning of a spherical harmonic domain by using the weight of the maximum directional beam former as a new sparse dictionary, so that the method is suitable for the situations that the sound sources are more and the position intervals are closer, and is more accurate in positioning the sound sources.

Description

Multi-sound-source positioning method and device based on spherical microphone array

Technical Field

The invention relates to the technical field of voice signal processing, in particular to a multi-sound-source positioning method and device based on a spherical microphone array.

Background

At present, the multi-sound-source sparse positioning of the spherical microphone array is mainly to obtain sound source position information in a spherical harmonic domain by using a compressed sensing theory. The sparse dictionary of the most common spherical harmonic domain sparse positioning method is the weight of a delay and sum beam former, and when the number of sound sources is large and the position interval is close, the positioning of the sound source position is inaccurate, and the resolution is poor.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the art described above. Therefore, a first objective of the present invention is to provide a method for positioning multiple sound sources based on a spherical microphone array, in which the weight of a maximum directional beam former is used as a new sparse dictionary to improve the resolution of sparse positioning in a spherical harmonic domain, and the method is suitable for positioning more sound sources and closer positions.

The second purpose of the invention is to provide a multi-sound-source positioning device based on a spherical microphone array.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for positioning multiple sound sources based on a spherical microphone array, where the multiple sound sources are D sound sources, and the method includes:

s1, acquiring a voice signal transmitted by the spherical microphone array;

s2, performing framing windowing and short-time Fourier transform processing on the voice signal to obtain a time-frequency domain voice signal;

s3, performing spherical Fourier transform processing on the time-frequency domain voice signal to obtain a spherical harmonic domain voice signal;

s4, constructing a sparse dictionary for the spherical harmonic domain voice signals according to the weight of the maximum directional beam former and the preset first grid precision;

and S5, calculating the first positions of the D sound sources in the sparse dictionary by using a sparse Bayesian learning method and an expectation maximization method.

According to the multi-sound-source positioning method based on the spherical microphone array provided by the embodiment of the first aspect of the invention, the three-dimensional space characteristics of the spherical microphone array are utilized to carry out omnibearing sampling on the azimuth angle and the pitch angle of a voice signal, voice signals sent out by D sound source positions are obtained, performing framing windowing and short-time Fourier transform processing on the voice signal to obtain a time-frequency domain voice signal, improving the processing efficiency of the voice signal, performing spherical Fourier transform processing and converting the processed signal into a spherical harmonic domain, constructing a sparse dictionary by using the weight value of the maximum directional beam former according to the first grid precision, wherein the first grid precision is (10 degrees, 5 degrees), wherein 10 degrees is an azimuth angle interval angle, 5 degrees is a pitch angle interval angle, in the sparse dictionary, and calculating to obtain first positions of the D sound sources by using a sparse Bayesian learning method and an expectation maximization method. The sparse dictionary designed by the weight of the maximum directional beam former has high resolution, can accurately obtain the sound source position information, and is widely used for positioning the position of multiple sound source.

According to some embodiments of the present invention, obtaining the first positions of the D sound sources in the sparse dictionary by applying a sparse bayesian learning method and an expectation maximization method includes:

s51, setting a first parameter value of sparse Bayesian learning and a first preset iteration number N, and calculating a first mean value and a first covariance of a sparse matrix for the sparse dictionary according to the first parameter value of the sparse Bayesian learning;

s52, estimating a first parameter value of the sparse Bayesian learning according to the first mean value and the first covariance by using an expectation maximization method to obtain a second parameter value; calculating a second mean value and a second covariance of the sparse matrix for the sparse dictionary according to a second parameter value of the sparse Bayesian learning, and completing one iteration; and when the current iteration times are determined to be equal to a first preset iteration time N, stopping iteration, calculating to obtain an Nth average value, and taking the first D highest peaks of the energy spectrum of the Nth average value as the first positions of the D sound sources.

According to some embodiments of the invention, after obtaining the first positions of the D sound sources, further comprising:

s6, setting a second preset iteration number M;

s7, mesh refinement is carried out in a preset area of the first positions of the D sound sources through a first preset rule;

s8, after the sparse dictionary is reconstructed according to the grid precision calculated by the first preset rule, second positions of the D sound sources are obtained by applying a sparse Bayesian learning method and an expectation maximization method, and one iteration of grid refinement is completed; and when the iteration number of the current grid refinement is determined to be equal to a second preset iteration number M, stopping iteration, calculating to obtain a final mean value, and taking the first D highest peaks of the energy spectrum of the mean value as the Mth positions of the D sound sources.

According to some embodiments of the invention, the first preset rule comprises calculating a grid accuracy by the following formula;

wherein, theta^(j)Representing a grid set Θ in the jth iteration;

and

respectively representing the interval angles of a pitch angle and an azimuth angle in the jth iteration; θ represents a pitch angle; phi represents an azimuth; phi_θRepresenting the pitch angle in the jth iteration; phi_φIndicating the azimuth in the j-th iteration.

According to some embodiments of the invention, the weights of the maximum directional beamformer are:

wherein, b_n(kr) represents the intensity of the mode,

representing spherical harmonics, n representing the order, m representing degrees, r representing the radius of the spherical microphone array, theta representing the pitch angle, and phi representing the azimuth angle.

According to some embodiments of the invention, the first grid precision is (10 °, 5 °), wherein 10 ° is an azimuth interval angle and 5 ° is a pitch interval angle.

In order to achieve the above object, a second aspect of the present invention provides a multi-sound-source positioning device based on a spherical microphone array, where the multi-sound-source positioning device includes D sound sources:

the acquisition module is used for acquiring the voice signal transmitted by the spherical microphone array;

the first signal processing module is used for performing frame windowing and short-time Fourier transform processing on the voice signal to obtain a time-frequency domain voice signal;

the second signal processing module is used for carrying out spherical Fourier transform processing on the time-frequency domain voice signal to obtain a spherical harmonic domain voice signal;

the sparse dictionary building module is used for building a sparse dictionary for the spherical harmonic domain voice signals according to the weight of the maximum directional beam former and the preset first grid precision;

and the first calculation module is used for calculating and obtaining first positions of the D sound sources in the sparse dictionary by applying a sparse Bayesian learning method and an expectation maximization method.

According to the multi-sound-source positioning device based on the spherical microphone array provided by the embodiment of the second aspect of the invention, the three-dimensional space characteristics of the spherical microphone array are utilized to carry out omnibearing sampling on the azimuth angle and the pitch angle of the voice signal, so as to obtain the voice signals sent out by D sound source positions, performing framing windowing and short-time Fourier transform processing on the voice signal to obtain a time-frequency domain voice signal, improving the processing efficiency of the voice signal, performing spherical Fourier transform processing and converting the processed signal into a spherical harmonic domain, constructing a sparse dictionary by using the weight value of the maximum directional beam former according to the first grid precision, wherein the first grid precision is (10 degrees, 5 degrees), wherein 10 degrees is an azimuth angle interval angle, 5 degrees is a pitch angle interval angle, in the sparse dictionary, and calculating to obtain first positions of the D sound sources by using a sparse Bayesian learning method and an expectation maximization method. The sparse dictionary designed by the weight of the maximum directional beam former has high resolution, can accurately obtain the sound source position information, and is widely used for positioning the position of multiple sound source.

According to some embodiments of the invention, the ball microphone array based multi-source localization apparatus further comprises:

the first storage module is used for storing a first parameter value of sparse Bayesian learning and a first preset iteration number N;

the first calculation module is used for calculating a first mean value and a first covariance of a sparse matrix for the sparse dictionary according to a first parameter value of sparse Bayesian learning; estimating a first parameter value of the sparse Bayesian learning according to the first mean value and the first covariance by using an expectation maximization method to obtain a second parameter value; calculating a second mean value and a second covariance of the sparse matrix for the sparse dictionary according to a second parameter value of the sparse Bayesian learning, and completing one iteration; and when the current iteration times are determined to be equal to a first preset iteration time N, stopping iteration, calculating to obtain an Nth average value, and taking the first D highest peaks of the energy spectrum of the Nth average value as the first positions of the D sound sources.

the second storage module is used for storing a second preset iteration number M;

the second calculation module is used for carrying out grid refinement in a preset area of the first positions of the D sound sources through a first preset rule; after a sparse dictionary is reconstructed according to the grid precision calculated by the first preset rule, second positions of the D sound sources are obtained by applying a sparse Bayesian learning method and an expectation maximization method, and one iteration of grid refinement is completed; and when the iteration number of the current grid refinement is determined to be equal to a second preset iteration number M, stopping iteration, calculating to obtain a final mean value, and taking the first D highest peaks of the energy spectrum of the mean value as the Mth positions of the D sound sources.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a method for multi-source localization based on a spherical microphone array according to one embodiment of the present invention;

FIG. 2 is a flow diagram of a method for multiple sound source localization from a sparse dictionary according to one embodiment of the present invention;

FIG. 3 is a flow chart of a method for multi-source localization based on a spherical microphone array according to yet another embodiment of the present invention;

FIG. 4 is a block diagram of a ball microphone array based multi-source pointing device according to a first embodiment of the present invention;

FIG. 5 is a block diagram of a multi-source ball microphone array based positioning apparatus according to a second embodiment of the present invention;

fig. 6 is a block diagram of a multi-source positioning device based on a ball microphone array according to a third embodiment of the present invention.

Reference numerals:

the device comprises a multi-sound-source positioning device 100 based on a spherical microphone array, an acquisition module 1, a first signal processing module 2, a second signal processing module 3, a sparse dictionary construction module 4, a first calculation module 5, a first storage module 6, a second storage module 7 and a second calculation module 8.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

A multi-sound-source positioning method and apparatus based on a ball microphone array according to an embodiment of the present invention will be described with reference to fig. 1 to 6.

FIG. 1 is a flow chart of a method for multi-source localization based on a spherical microphone array according to one embodiment of the present invention; as shown in fig. 1, an embodiment of the first aspect of the present invention provides a method for positioning multiple sound sources based on a spherical microphone array, where the multiple sound sources are D sound sources, and the method includes:

s1, acquiring a voice signal transmitted by the spherical microphone array;

According to the multi-sound-source positioning method based on the spherical microphone array provided by the first embodiment of the invention, the three-dimensional space characteristics of the spherical microphone array are utilized to carry out omnibearing sampling on the azimuth angle and the pitch angle of a voice signal, voice signals sent out by D sound source positions are obtained, performing framing windowing and short-time Fourier transform processing on the voice signal to obtain a time-frequency domain voice signal, improving the processing efficiency of the voice signal, performing spherical Fourier transform processing and converting the processed signal into a spherical harmonic domain, constructing a sparse dictionary by using the weight value of the maximum directional beam former according to the first grid precision, wherein the first grid precision is (10 degrees, 5 degrees), wherein 10 degrees is an azimuth angle interval angle, 5 degrees is a pitch angle interval angle, in the sparse dictionary, and calculating to obtain first positions of the D sound sources by using a sparse Bayesian learning method and an expectation maximization method. The sparse dictionary designed by the weight of the maximum directional beam former has high resolution, can accurately obtain the sound source position information, and is widely used for positioning the position of multiple sound source.

FIG. 2 is a flow diagram of a method for multiple sound source localization from a sparse dictionary according to one embodiment of the present invention; as shown in fig. 2, according to some embodiments of the present invention, obtaining the first positions of the D sound sources in the sparse dictionary by using a sparse bayesian learning method and an expectation maximization method includes:

The working principle and the beneficial effects of the technical scheme are as follows: setting a first parameter value of the sparse Bayesian learning, namely a parameter initial value, calculating a first mean value and a first covariance of a sparse matrix for the sparse dictionary according to the first parameter value of the sparse Bayesian learning, and estimating the first parameter value of the sparse Bayesian learning by using the first mean value and the first covariance to obtain a second parameter value. And calculating a second mean value and a second covariance of the sparse matrix through a second parameter value, completing one iteration, stopping the iteration until a convergence condition is met, namely when the current iteration number is equal to a first preset iteration number N, wherein N can be 1000, calculating to obtain an Nth mean value, and taking the first D highest peaks of the energy spectrum of the Nth mean value as the first positions of the D sound sources to improve the positioning accuracy of the sound source positions.

FIG. 3 is a flow chart of a method for multi-source localization based on a spherical microphone array according to yet another embodiment of the present invention; as shown in fig. 3, after obtaining the first positions of the D sound sources, the method further includes:

s6, setting a second preset iteration number M;

The working principle and the beneficial effects of the technical scheme are as follows: when the first positions of the D sound sources are obtained, in order to obtain more accurate sound source position information, mesh refinement is performed in a preset area of the first positions, the preset area is a nearby area of the sound source position, a second preset iteration number M is set for obtaining the sound source position with preset precision, mesh refinement is performed in the preset area of the first positions of the D sound sources through a first preset rule, a sparse dictionary is reconstructed according to the mesh precision calculated according to the first preset rule, the mesh precision calculated through the first preset rule is higher, and illustratively, the first mesh precision is (10 degrees, 5 degrees, wherein 10 degrees is an azimuth angle interval angle, and 5 degrees is a pitch angle interval angle. The grid precision calculated by the first preset rule can be (5 degrees and 2 degrees), a sparse dictionary is reconstructed according to the grid precision calculated by the first preset rule and the weight value of the maximum directional beam former, the sound source positions are recalculated by using the method, the second positions of the D sound sources are obtained, and one iteration of grid refinement is completed; and continuously carrying out grid refinement in the area near the position of the second sound source, reconstructing a sparse dictionary again according to the grid precision recalculated according to the first preset rule and the weight value of the maximum directional beam former, wherein the grid precision recalculated according to the first preset rule can be (4 degrees and 1 degree), recalculating the position of the sound source by using the method to obtain the third positions of the D sound sources, stopping iteration until the iteration number of the current grid refinement is determined to be equal to the second preset iteration number M, calculating to obtain a final average value, taking the first D highest peaks of the energy spectrum of the average value as the Mth positions of the D sound sources to obtain the sound source position with preset precision, and positioning the sound source position more accurately. The second preset iteration number is 3, and under the condition that the calculated amount and the calculated complexity are lower than the preset threshold, the sound source position information with preset precision is obtained.

wherein, theta^(j)Representing a grid set Θ in the jth iteration;

and

wherein, b_n(kr) represents the intensity of the mode,

representing spherical harmonics, n representing the order, m representing degrees, r representing the radius of the spherical microphone array, theta representing the pitch angle, and phi representing the azimuth angle. The sparse dictionary designed by the weight of the maximum directional beam former has high resolution, can accurately obtain the sound source position information, and is widely used for positioning the position of multiple sound source.

According to some embodiments of the invention, the first grid precision is (10 °, 5 °), wherein 10 ° is an azimuth interval angle and 5 ° is a pitch interval angle. The first grid precision is (10 degrees and 5 degrees), and the first positions of the D sound sources obtained by calculation can meet the requirement of positioning precision while the calculation amount is reduced.

FIG. 4 is a block diagram of a ball microphone array based multi-source localization apparatus 100 according to a first embodiment of the present invention; as shown in fig. 4, a second embodiment of the present invention provides a multi-sound-source positioning apparatus 100 based on a spherical microphone array, where the multi-sound-source positioning apparatus includes D sound sources:

the acquisition module 1 is used for acquiring the voice signal transmitted by the spherical microphone array;

the first signal processing module 2 is configured to perform frame windowing and short-time fourier transform processing on the voice signal to obtain a time-frequency domain voice signal;

the second signal processing module 3 is configured to perform spherical fourier transform processing on the time-frequency domain voice signal to obtain a spherical harmonic domain voice signal;

the sparse dictionary construction module 4 is used for constructing a sparse dictionary for the spherical harmonic domain voice signal according to the weight of the maximum directional beam former and the preset first grid precision;

and the first calculation module 5 is used for calculating and obtaining first positions of the D sound sources in the sparse dictionary by applying a sparse Bayesian learning method and an expectation maximization method.

The multi-source positioning device 100 based on the spherical microphone array according to the embodiment of the second aspect of the present invention utilizes the three-dimensional spatial characteristics of the spherical microphone array to perform omni-directional sampling on the azimuth angle and the pitch angle of the voice signal, so as to obtain the voice signals emitted from D sound source positions, performing framing windowing and short-time Fourier transform processing on the voice signal to obtain a time-frequency domain voice signal, improving the processing efficiency of the voice signal, performing spherical Fourier transform processing and converting the processed signal into a spherical harmonic domain, constructing a sparse dictionary by using the weight value of the maximum directional beam former according to the first grid precision, wherein the first grid precision is (10 degrees, 5 degrees), wherein 10 degrees is an azimuth angle interval angle, 5 degrees is a pitch angle interval angle, in the sparse dictionary, and calculating to obtain first positions of the D sound sources by using a sparse Bayesian learning method and an expectation maximization method. The sparse dictionary designed by the weight of the maximum directional beam former has high resolution, can accurately obtain the sound source position information, and is widely used for positioning the position of multiple sound source.

FIG. 5 is a block diagram of a multi-source ball microphone array based positioning apparatus 100 according to a second embodiment of the present invention; as shown in fig. 5, the method further includes:

the first storage module 6 is used for storing a first parameter value of sparse Bayesian learning and a first preset iteration number N;

the first calculation module 5 is configured to calculate a first mean value and a first covariance of a sparse matrix for the sparse dictionary according to a first parameter value of sparse bayesian learning; estimating a first parameter value of the sparse Bayesian learning according to the first mean value and the first covariance by using an expectation maximization method to obtain a second parameter value; calculating a second mean value and a second covariance of the sparse matrix for the sparse dictionary according to a second parameter value of the sparse Bayesian learning, and completing one iteration; and when the current iteration times are determined to be equal to a first preset iteration time N, stopping iteration, calculating to obtain an Nth average value, and taking the first D highest peaks of the energy spectrum of the Nth average value as the first positions of the D sound sources.

FIG. 6 is a block diagram of a multi-source ball microphone array based positioning device 100 according to a third embodiment of the present invention; as shown in fig. 6, the method further includes:

a second storage module 7, configured to store a second preset iteration number M;

the second calculation module 8 is used for performing mesh refinement in a preset area of the first positions of the D sound sources through a first preset rule; after a sparse dictionary is reconstructed according to the grid precision calculated by the first preset rule, second positions of the D sound sources are obtained by applying a sparse Bayesian learning method and an expectation maximization method, and one iteration of grid refinement is completed; and when the iteration number of the current grid refinement is determined to be equal to a second preset iteration number M, stopping iteration, calculating to obtain a final mean value, and taking the first D highest peaks of the energy spectrum of the mean value as the Mth positions of the D sound sources.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A multi-sound-source positioning method based on a spherical microphone array is characterized in that the multi-sound source is D sound sources, and comprises the following steps:

s1, acquiring a voice signal transmitted by the spherical microphone array;

2. The method as claimed in claim 1, wherein the obtaining the first positions of the D sound sources in the sparse dictionary by using a sparse bayesian learning method and an expectation maximization method comprises:

3. The method of claim 1, wherein after obtaining the first positions of the D sound sources, the method further comprises:

s6, setting a second preset iteration number M;

4. The ball microphone array based multi-source localization method of claim 3, wherein the first preset rule comprises calculating a grid accuracy by the following formula;

wherein, theta^(j)Representing a grid set Θ in the jth iteration;

and

respectively representing the interval angles of a pitch angle and an azimuth angle in the jth iteration; θ represents a pitch angle; phi represents an azimuth; phi_θRepresenting the pitch angle in the jth iteration; phi_φDenotes the j (th)Azimuth in the sub-iteration.

5. The method of claim 1, wherein the weights of the maximum directional beamformer are:

wherein, b_n(kr) represents the intensity of the mode,

6. The ball microphone array based multi-source localization method of claim 5, wherein the first grid precision is (10 °, 5 °), wherein 10 ° is an azimuth interval angle and 5 ° is a pitch interval angle.

7. A multi-sound-source positioning device based on a spherical microphone array, wherein the multi-sound-source comprises D sound sources, and the device comprises:

8. The ball microphone array based multi-source pointing device of claim 7, further comprising:

9. The ball microphone array based multi-source pointing device of claim 7, further comprising: