CN113053406A

CN113053406A - Sound signal identification method and device

Info

Publication number: CN113053406A
Application number: CN202110502126.9A
Authority: CN
Inventors: 侯海宁
Original assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd; Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2021-05-08
Filing date: 2021-05-08
Publication date: 2021-06-29

Abstract

The disclosure relates to a method and a device for recognizing a sound signal. The method relates to an intelligent voice interaction technology, and solves the problem of low accuracy of sound source signal identification under the scene of strong interference and low signal-to-noise ratio. The method comprises the following steps: acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively; performing first-stage noise reduction processing on the original observation data to obtain observation signal estimation data; obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data; performing second-stage noise reduction processing on the observation signal data according to the positioning information to obtain a beam enhancement output signal; and according to the beam enhancement output signal, obtaining a time domain sound source signal with enhanced signal-to-noise ratio of each sound source. The technical scheme provided by the disclosure is suitable for voice interaction equipment, and realizes high-quality and low-noise voice signal identification.

Description

Sound signal identification method and device

Technical Field

The present disclosure relates to intelligent voice interaction technologies, and in particular, to a method and an apparatus for recognizing a voice signal.

Background

In the times of Internet of things and AI, intelligent voice is taken as one of artificial intelligence core technologies, so that the man-machine interaction mode is enriched, and the use convenience of intelligent products is greatly improved.

The intelligent product equipment mainly adopts a microphone array formed by a plurality of microphones for pickup, and applies a microphone beam forming technology or a blind source separation technology to inhibit environmental interference and improve the processing quality of voice signals so as to improve the voice recognition rate in a real environment.

Microphone beam forming technology needs to estimate sound source direction, and in addition, in order to endow stronger intelligence and perception, the intelligent device can be equipped with the pilot lamp in general, when interacting with the user with the pilot lamp accuracy point to the user rather than the interference, lets the user feel in the face-to-face conversation with intelligent device, strengthens user's interactive experience. Based on this, in an environment where there are interfering sound sources, it is important to accurately estimate the direction of the user (i.e., the sound source).

Generally, a sound source direction-finding algorithm directly uses data acquired by a microphone, and performs direction-finding estimation by using an algorithm such as a Phase-transformed weighted sound source-Phase Transform (SRP-PHAT) with controllable Response Power. However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, and the effective sound source cannot be accurately positioned in the direction of each interference sound source.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a method and an apparatus for recognizing a voice signal.

According to a first aspect of the embodiments of the present disclosure, there is provided a sound signal identification method, including:

acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively;

performing first-stage noise reduction processing on the original observation data to obtain observation signal estimation data;

obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data;

performing second-stage noise reduction processing on the observation signal data according to the positioning information to obtain a beam enhancement output signal;

and according to the beam enhancement output signal, obtaining a time domain sound source signal with enhanced signal-to-noise ratio of each sound source.

Further, the step of performing a first-stage noise reduction process on the original observation data to obtain observation signal estimation data includes:

initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and columns of the separation matrix is the number of the sound sources;

obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals;

solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;

updating the weighted covariance matrix according to the prior frequency domain estimate;

updating the separation matrix according to the updated weighted covariance matrix;

deblurring the updated separation matrix;

and separating the original observation data according to the deblurred separation matrix, and taking the posterior domain estimation data obtained by separation as the observation signal estimation data.

Further, the step of obtaining the prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises:

and separating the observation signal matrix according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.

Further, the step of updating the weighted covariance matrix based on the a priori frequency domain estimates comprises:

and updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

Further, the step of updating the separation matrix according to the updated weighted covariance matrix includes:

respectively updating the separation matrix of each sound source according to the weighted covariance matrix of each sound source;

and updating the separation matrix to be a conjugate transpose matrix after the separation matrixes of the sound sources are combined.

Further, the step of deblurring the updated separation matrix comprises:

and carrying out amplitude deblurring processing on the separation matrix by adopting a minimum distortion criterion.

Further, the step of obtaining the positioning information and the observed signal data of each sound source according to the observed signal estimation data includes:

obtaining the observation signal data of each sound source at each acquisition point according to the observation signal estimation data;

and respectively estimating the direction of each sound source according to the observation signal data of each sound source at each acquisition point to obtain the positioning information of each sound source.

Further, the step of respectively estimating the azimuth of each sound source according to the observation signal data of each sound source at each collection point to obtain the positioning information of each sound source comprises:

and respectively estimating each sound source as follows to obtain the azimuth of each sound source:

and forming the observation data of the acquisition points by using the observation signal data of the same sound source at different acquisition points, and positioning the sound source by a direction-finding algorithm to obtain the positioning information of each sound source.

Further, the step of obtaining the beam enhanced output signal includes the step of performing a second-stage noise reduction process on the positioning information of the sound source, where the positioning information of the sound source includes the azimuth coordinate of the sound source:

respectively calculating the propagation delay difference value of each sound source according to the azimuth coordinate of each sound source and the azimuth coordinate of each acquisition point, wherein the propagation delay difference value is the time difference value of transmitting the sound emitted by the sound source to each acquisition point;

and respectively carrying out secondary noise reduction on each sound source through beam delay summation beam forming processing by using the observation signal data of each sound source to obtain the beam enhanced output signal of each sound source.

Further, the step of obtaining a time-domain sound source signal with an enhanced signal-to-noise ratio of each sound source according to the beam enhancement output signal includes:

and performing short-time Fourier inverse transformation on the beam enhanced output signals of each sound source, and then overlapping and adding to obtain time-domain sound source signals with enhanced signal-to-noise ratios of each sound source.

According to a second aspect of the embodiments of the present disclosure, there is provided a sound signal recognition apparatus including:

the original data acquisition module is used for acquiring original observation data acquired by at least two acquisition points on at least two sound sources respectively;

the first noise reduction module is used for carrying out first-stage noise reduction processing on the original observation data to obtain observation signal estimation data;

the positioning module is used for obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data;

the second noise reduction module is used for carrying out second-stage noise reduction processing on the observation signal data according to the positioning information to obtain a beam enhancement output signal;

and the enhanced signal output module is used for enhancing the output signal according to the wave beam to obtain a time domain sound source signal with enhanced signal-to-noise ratio of each sound source.

Further, the first noise reduction module includes:

the matrix initialization submodule is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, and the number of rows and the number of columns of the separation matrix are the number of the sound sources;

the frequency domain data acquisition submodule is used for solving time domain signals at each acquisition point and constructing an observation signal matrix according to the frequency domain signals corresponding to the time domain signals;

the prior frequency domain estimation submodule is used for solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix;

a covariance matrix update submodule for updating the weighted covariance matrix according to the prior frequency domain estimate;

a separation matrix updating submodule for updating the separation matrix according to the updated weighted covariance matrix;

a deblurring submodule for deblurring the updated separation matrix;

and the posterior domain estimation submodule is used for separating the original observation data according to the deblurred separation matrix and taking the posterior domain estimation data obtained by separation as the observation signal estimation data.

Further, the priori frequency domain estimation submodule is configured to separate the observed signal matrix according to the separation matrix of the previous frame, so as to obtain a priori frequency domain estimation of each sound source of the current frame.

Further, the covariance matrix update sub-module is configured to update the weighted covariance matrix according to the observed signal matrix and a conjugate transpose matrix of the observed signal matrix.

Further, the separation matrix update sub-module includes:

the first updating submodule is used for respectively updating the separation matrix of each sound source according to the weighted covariance matrix of each sound source;

and the second updating submodule is used for updating the separation matrix into a conjugate transpose matrix after the separation matrixes of the sound sources are combined.

Further, the deblurring submodule is configured to perform amplitude deblurring processing on the separation matrix by using a minimum distortion criterion.

Further, the positioning module comprises:

the sound source data estimation submodule is used for obtaining the observation signal data of each sound source at each acquisition point according to the observation signal estimation data;

and the first positioning submodule is used for respectively estimating the position of each sound source according to the observation signal data of each sound source at each acquisition point to obtain the positioning information of each sound source.

Further, the first positioning sub-module is configured to estimate each sound source as follows, and obtain the position of each sound source:

Further, the positioning information of the sound source includes an azimuth coordinate of the sound source, and the second noise reduction module includes:

the time delay calculation submodule is used for respectively calculating the propagation time delay difference value of each sound source according to the azimuth coordinate of each sound source and the azimuth coordinate of each acquisition point, and the propagation time delay difference value is the time difference value of transmitting the sound emitted by the sound source to each acquisition point;

a beam summation submodule for performing a second level of noise reduction on each sound source through beam delay summation beam forming processing by using the observation signal data of each sound source to obtain the beam enhanced output signal of each sound source,

further, the enhanced signal output module is configured to perform short-time fourier inverse transformation on the beam enhanced output signals of each sound source, and then overlap and add the beam enhanced output signals to obtain a time-domain sound source signal with an enhanced signal-to-noise ratio of each sound source.

According to a third aspect of embodiments of the present disclosure, there is provided a computer apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects: the method comprises the steps of obtaining original observation data acquired by at least two acquisition points for at least two sound sources respectively, then carrying out primary noise reduction processing on the original observation data to obtain observation signal estimation data, then obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data, then carrying out secondary noise reduction processing on the observation signal data according to the positioning information to obtain beam enhancement output signals, and obtaining time domain sound source signals with enhanced signal-to-noise ratios of the sound sources according to the beam enhancement output signals. After the original observation data is subjected to noise reduction processing to position the sound source, the signal-to-noise ratio is further improved through beam enhancement to highlight the signal, the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a flow chart illustrating a method of sound signal recognition according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating yet another method of sound signal identification according to an example embodiment.

Fig. 3 is a flow chart illustrating yet another method of sound signal identification according to an example embodiment.

Fig. 4 is a flow chart illustrating yet another method of sound signal recognition according to an example embodiment.

Fig. 5 is a flow chart illustrating yet another sound signal identification method according to an example embodiment.

Fig. 6 is a block diagram illustrating a voice signal recognition apparatus according to an exemplary embodiment.

Fig. 7 is a schematic diagram illustrating a structure of a first noise reduction module 602 according to an exemplary embodiment.

Fig. 8 is a schematic structural diagram illustrating the separation matrix update submodule 705 according to an exemplary embodiment.

Fig. 9 is a schematic diagram illustrating a structure of a first noise reduction module 602 according to an exemplary embodiment.

FIG. 10 is a schematic diagram illustrating a structure of a second noise reduction module 604 according to an exemplary embodiment.

Fig. 11 is a block diagram illustrating an apparatus (a general structure of a mobile terminal) according to an example embodiment.

Fig. 12 is a block diagram illustrating an apparatus (general structure of a server) according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

Generally, a sound source direction-finding algorithm directly utilizes data acquired by a microphone and performs direction-finding estimation by using algorithms such as microphone array sound source localization (SRP-PHAT). However, the algorithm depends on the signal-to-noise ratio of the signal, the accuracy is not high enough under the condition of low signal-to-noise ratio, and the effective sound source cannot be accurately positioned in the direction of each interference sound source.

In order to solve the above problem, embodiments of the present disclosure provide a sound signal identification method and apparatus. The collected data are subjected to noise reduction processing and then subjected to direction-finding positioning, the signal-to-noise ratio is further improved by performing noise reduction processing on the direction-finding positioning result once again, and then a final time domain sound source signal is obtained, so that the influence of an interference sound source is eliminated, the problem of low sound source positioning accuracy under the scene of strong interference and low signal-to-noise ratio is solved, and the sound source positioning with high efficiency and high interference resistance is realized.

An exemplary embodiment of the present disclosure provides a sound signal identification method, a flow of completing sound source localization using the method is shown in fig. 1, and the method includes:

step 101, acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively.

In this embodiment, the collection point may be a microphone. For example, there may be multiple microphones disposed on the same device, the multiple microphones making up a microphone array.

In this step, data acquisition is performed at each acquisition point, and the acquired data may be from multiple sound sources. The plurality of sound sources may include a target effective sound source and may also include an interfering sound source.

The acquisition point acquires original observation data of at least two sound sources.

And 102, performing primary noise reduction processing on the original observation data to obtain observation signal estimation data.

In this step, the acquired original observation data is subjected to a first-stage noise reduction process to eliminate noise influence generated by an interference sound source and the like.

After preprocessing, the original observation data may be signal-separated by a minimum distortion criterion (e.g., Markov decision process, MDP for short), and the estimation of the observation data of each sound source at each acquisition point may be recovered.

And after the noise reduction processing, obtaining the estimation data of the observation signal subjected to noise reduction.

And 103, obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data.

In this step, after the observation signal estimation data which is free of noise influence and is closer to the real sound source data is obtained, the observation signal data of each sound source at each acquisition point can be obtained accordingly.

The sound source is further localized to obtain localization information of each sound source, and the localization information may include the direction of the sound source, for example, a three-dimensional coordinate value in a three-dimensional coordinate system. The direction of the sound source can be estimated according to the observation signal estimation data of each sound source through an SRP-PHAT algorithm, and the positioning of each sound source is completed.

And step 104, performing second-stage noise reduction processing on the observation signal data according to the positioning information to obtain a beam enhancement output signal.

In this step, for the noise interference residue in the observation signal data obtained in step 103, to further improve the idea quality, a delay-and-sum beamforming technique is used to perform a second-stage noise reduction process. The method can enhance the sound source signal and inhibit other direction signals (signals possibly interfering with the sound source signal), thereby further improving the signal-to-noise ratio of the sound source signal, and further positioning and identifying the sound source can be carried out on the basis to obtain a more accurate result.

And 105, enhancing the output signal according to the wave beam to obtain a time domain sound source signal with the enhanced signal-to-noise ratio of each sound source.

In the step, according to the beam enhancement output signal, a time domain sound source signal with enhanced signal-to-noise ratio after the beam separation processing is obtained through short-time inverse Fourier transform (ISTFT) and overlap addition, and compared with observation signal data, the time domain sound source signal has smaller noise, can truly and accurately reflect the sound signal emitted by the sound source, and realizes accurate and efficient sound signal identification.

An exemplary embodiment of the present disclosure further provides a sound signal identification method, which performs noise reduction processing on original observation data based on blind source separation to obtain observation signal estimation data, where a specific flow is shown in fig. 2, and includes:

step 201, initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point.

In this step, the number of rows and columns of the separation matrix is the number of sound sources, and the weighted covariance matrix is a 0 matrix.

In this embodiment, a scene with two microphones as acquisition points is taken as an example. As shown in fig. 3, smart speaker a has two microphones: mic1 and mic 2; there are two sound sources in the space around smart speaker a: s1 and s 2. The signals from both sources can be picked up by both microphones. The signals of the two sound sources are mixed together in each microphone. The following coordinate system is established:

let the microphone coordinate of the intelligent sound box A be

And x, y and z are three axes of x, y and z in a three-dimensional coordinate system.

Is the x-axis coordinate value of the ith microphone,

is the y-axis coordinate value of the ith microphone,

is the z-coordinate value of the ith microphone. Where i is 1. In this example, M is 2.

A time domain signal representing the ith microphone frame τ, i ═ 1, 2; m is 1, …, Nfft. Nfft is the frame length of each sub-frame in the sound system of smart speaker a. After windowing the frame obtained according to Nfft, the corresponding frequency domain signal X is obtained by fourier transform (FFT)_i(k,τ)。

For the convolutional blind separation problem, the frequency domain model is:

X(k,τ)＝H(k,τ)s(k,τ)

Y(k,τ)＝W(k,τ)X(k,τ)

wherein X (k, τ) ═ X₁(k,τ),X₂(k,τ),...,X_M(k,τ)]^TFor the observation of the data for the microphone,

s(k,τ)＝[s₁(k,τ),s₂(k,τ),...,s_M(k,τ)]^Tfor the vector of the sound source signal,

Y(k,τ)＝[Y₁(k,τ),Y₂(k,τ),...,Y_M(k,τ)]^Tfor separating the signal vectors, H (k, τ) is a mixing matrix of M × M dimensions, W (k, τ) is a separating matrix of M × M dimensions, k is the number of frequency bands, τ is the number of frames, ()^TRepresenting a vector (or matrix) transpose.

Is the frequency domain data of sound source i.

The separation matrix is represented as:

W(k,τ)＝[w₁(k,τ),w₂(k,τ),...w_N(k,τ)]^H

wherein (C)^HRepresenting the conjugate transpose of a vector (or matrix).

In particular to the scenario shown in fig. 3:

defining the mixing matrix as:

wherein h is_ijIs the transfer function of sound source i to micj.

Defining the separation matrix as:

let the frame length of each frame in the sound system be Nfft, K ═ Nfft/2+ 1.

In this step, the separation matrix of each frequency point is initialized according to expression (1):

the separation matrix is a unit matrix; k is 1, K, which represents the K-th frequency point.

And initializing weighted covariance matrix V of each sound source at each frequency point according to expression (2)_i(k, τ) is the zero matrix:

wherein, K is 1, K represents the kth frequency point; i is 1, 2.

Step 202, obtaining time domain signals at each acquisition point, and constructing an observation signal matrix according to frequency domain signals corresponding to the time domain signals.

In this step, the

A time domain signal representing the ith microphone frame τ, i ═ 1, 2; m is 1, … Nfft. According to the expression (3), windowing and performing Nfft point FFT to obtain a corresponding frequency domain signal X_i(k,τ)：

The observed signal matrix is then:

X(k,τ)＝[X₁(k,τ),X₂(k,τ)]^T

wherein K is 1.

And 203, solving prior frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observation signal matrix.

In this step, the observed signal matrix is first separated according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame. For the application scenario shown in fig. 3, w (k) of the previous frame is used to obtain a priori frequency domain estimates Y (k, τ) of two sound source signals in the current frame.

For example, let Y (k, τ) ═ Y₁(k,τ),Y₂(k,τ)]^TK is 1. Wherein Y is₁(k,τ),Y₂(k, τ) are estimates of sound sources s1 and s2 at time bins (k, τ), respectively. The observation matrix X (k, τ) is separated by using the separation matrix W (k, τ) according to expression (4):

Y(k,τ)＝W(k,τ)X(k,τ) k＝1,..,K (4)

then according to expression (5), the frequency domain of the ith sound source at the τ th frame is estimated as:

wherein i is 1, 2.

And 204, updating the weighted covariance matrix according to the prior frequency domain estimation.

In this step, the weighted covariance matrix is updated according to the observed signal matrix and the conjugate transpose matrix of the observed signal matrix.

For the application scenario shown in FIG. 3, the weighted covariance matrix V is updated_i(k,τ)。

For example, the update of the weighted covariance matrix is performed according to expression (6):

the contrast function is defined as:

G_R(r_i(τ))＝r_i(τ)

defining the weighting coefficients as:

and step 205, updating the separation matrix according to the updated weighted covariance matrix.

In this step, the separation matrix of each sound source is updated according to the weighted covariance matrix of each sound source, and then the separation matrix is updated to be the conjugate transpose matrix after the separation matrices of each sound source are merged. For the application scenario shown in fig. 3, the separation matrix W (k, τ) is updated.

For example, the separation matrix W (k, τ) is updated according to expressions (7), (8), (9):

w_i(k,τ)＝(W(k,τ-1)V_i(k,τ))^-1e_i (7)

W(k,τ)＝[w₁(k,τ),w₂(k,τ)]^H (9)

i＝1,2。

step 206, deblurring the updated separation matrix.

In this step, the separation matrix may be amplitude deblurred by MDP. For the application scenario shown in fig. 3, the MDP algorithm is used to deblur the amplitude of W (k, τ).

For example, the MDP amplitude deblurring process is performed according to expression (10):

W(k,τ)＝diag(invW(k,τ))·W(k,τ) (10)

where invW (k, τ) is the inverse of W (k, τ). diag (invW (k, τ)) means that the non-dominant diagonal element of invW (k, τ) is set to 0.

And step 207, separating the original observation data according to the deblurred separation matrix, and taking the posterior domain estimation data obtained by separation as the observation signal estimation data.

In this step, for the application scenario shown in fig. 3, the original microphone signal is separated by using W (k, τ) after amplitude deblurring to obtain posterior frequency domain estimation data Y (k, τ) of the sound source signal, which is specifically expressed as expression (11):

Y(k,τ)＝[Y₁(k,τ),Y₂(k,τ)]^T＝W(k,τ)X(k,τ) (11)

after the posterior frequency domain estimation data with low signal-to-noise ratio is obtained, the posterior frequency domain estimation data is used as observation signal estimation data, the observation signal estimation data of each sound source at each acquisition point is further determined, and a high-quality data base is provided for the direction finding of each sound source.

An exemplary embodiment of the present disclosure further provides a sound signal identification method, where a flow of obtaining positioning information and observed signal data of each sound source according to the observed signal estimation data by using the method is shown in fig. 4, and the method includes:

step 401, obtaining observation signal data of each sound source at each collection point according to the observation signal estimation data.

In this step, the observation signal data of each sound source at each collection point is acquired based on the observation signal estimation data. For the application scenario shown in fig. 3, the superposition of each sound source at each microphone is estimated in this step to obtain an observed signal, and then the observed signal data of each sound source at each microphone is estimated.

For example, the raw observed data is separated by using W (k, tau) after MDP

Y(k,τ)＝[Y₁(k,τ),Y₂(k,τ)]^T. Root of herbaceous plantAccording to the principle of the MDP algorithm, the recovered Y (k, τ) is exactly an estimate of the observed signal of the sound source at the corresponding microphone, i.e.:

sound source s₁The estimation of the observed signal data at the mic1 is as in expression (12):

Y₁(k,τ)＝h₁₁s₁(k,τ)

remember again

Y₁₁(k,τ)＝Y₁(k,τ) (12)

Sound source s₂The observed signal data at mic2 is estimated as expression (13) as:

Y₂(k,τ)＝h₂₂s₂(k,τ)

remember again

Y₂₂(k,τ)＝Y₂(k,τ) (13)

Since the observed signal at each microphone is a superposition of the observed signal data of two sound sources, the sound source s₂The estimate of the observed data at the mic1, as in expression (14), is:

Y₁₂(k,τ)＝X₁(k,τ)-Y₁₁(k,τ) (14)

sound source s₁The estimation of the observed data at the mic2, as in expression (15), is

Y₂₁(k,τ)＝X₂(k,τ)-Y₂₂(k,τ) (15)

Thus, based on the MDP algorithm, the observation signal data of each sound source at each microphone is completely recovered, and the original phase information is reserved. Therefore, the azimuth of each sound source can be further estimated based on these observation signal data.

Step 402, respectively estimating the orientation of each sound source according to the observation signal data of each sound source at each collection point, and obtaining the positioning information of each sound source.

In this step, the following estimation is performed on each sound source to obtain the azimuth of each sound source:

For the application scenario shown in fig. 3, the directions of the sound sources are estimated using the SRP-PHAT algorithm using the observed signal data of the sound sources at the microphones, respectively.

The principle of the SRP-PHAT algorithm is as follows:

traversing the microphone array:

For i＝1:M-1

For j＝i+1:M

End

wherein X_i(τ)＝[X_i(1,τ),...,X_i(K,τ)]^TAnd is frequency domain data of the τ th frame of the ith microphone. K equals Nfft.

For the same reason X_j(τ)。.^*Representing the multiplication of the corresponding terms of the two vectors.

The coordinate of any point s on the unit sphere is(s)_x,s_y,s_z) Satisfy the following requirements

Calculating the time delay difference between the arbitrary point s and the arbitrary two microphones:

For i＝1:M-1

For j＝i+1:M

End

wherein fs is the system sampling rate and c is the speed of sound.

According to

Find correspondencesControllable Response Power (SRP):

traversing all points s on the unit sphere, and finding the point with the maximum value of the SRP, namely the estimated sound source:

following the example of the scenario of FIG. 3, in this step, Y may be added₁₁(k, τ) and Y₂₁(k, τ) instead of X (k, τ) ═ X₁(k,τ),X₂(k,τ)]^TThe sound source s can be estimated by introducing the SRP-PAHT algorithm₁The orientation of (1); same use of Y₂₂(k, τ) and Y₁₂(k, τ) instead of X (k, τ) ═ X₁(k,τ),X₂(k,τ)]^TEstimating a sound source s₂In the direction of the axis of rotation.

Because Y is on a separate basis₁₁(k, τ) and Y₂₁(k,τ)，Y₂₂(k, τ) and Y₁₂The signal-to-noise ratio of (k, tau) has been greatly improved, and therefore the azimuth estimation is more stable and accurate.

An exemplary embodiment of the present disclosure further provides a method for recognizing a sound signal, where a flow of performing a second-stage noise reduction process on the positioning information to obtain a beam enhanced output signal is shown in fig. 5, and the method includes:

step 501, respectively calculating propagation delay difference values of the sound sources according to the azimuth coordinates of the sound sources and the azimuth coordinates of the collection points.

In this embodiment, the propagation delay difference is a time difference between transmission of sound emitted by a sound source to each collection point.

In this step, the positioning information of the sound source includes the azimuth coordinates of the sound source. Still taking the application scenario shown in fig. 3 as an example, in a three-dimensional coordinate system, the sound source s₁In an orientation of

Sound source s₂In an orientation of

The collecting point can be a microphone, and the directions of the two microphones are respectively

And

first, a sound source s is calculated according to expressions (16) and (17)₁Delay difference tau to each microphone₁：

Calculating the time delay difference value tau from the sound source to each microphone according to the expressions (18) and (19)₂：

Step 502, using the observation signal data of each sound source, performing a second-level noise reduction on each sound source through beam delay and sum beam forming processing, and obtaining the beam enhanced output signal of each sound source.

In this step, taking the scenario shown in fig. 3 as an example, the second-stage noise reduction is performed on each sound source through beam delay and sum beamforming processing according to expressions (20) and (21):

using Y₁₁(k, τ) and Y₂₁(k, τ) to the sound source s₁Performing beam delay summation beam forming to obtain sound source s₁Is of the enhanced output signal YE₁(k,τ)：

Using Y₁₂(k, τ) and Y₂₂(k, τ) to the sound source s₂Performing beam delay summation beam forming to obtain sound source s₂Is of the enhanced output signal YE₂(k,τ)：

Where K is 1, …, K.

An exemplary embodiment of the present disclosure also provides a sound signal identification method, which can obtain a time-domain sound source signal with an enhanced signal-to-noise ratio of each sound source according to a beam enhanced output signal. The wave beam enhanced output signals of each sound source can be subjected to short-time Fourier inverse transformation and then overlapped and added to obtain time-domain sound source signals with enhanced signal-to-noise ratios of each sound source.

Still taking the application scenario shown in fig. 3 as an example, YE is respectively paired according to expression (22)₁(τ)＝[YE₁(1,τ),...,YE₁(K,τ)]And YE₂(τ)＝[YE₂(1,τ),...,YE₂(K,τ)]K is 1, K is the time domain sound source signal obtained by performing the ISTFT and overlap addition to obtain the signal-to-noise ratio of the separated beam and is recorded as the signal-to-noise ratio of the separated beam

Where m is 1, …, Nfft. i is 1, 2.

Because the observed data of the microphone is noisy, the algorithm is extremely dependent on the signal-to-noise ratio, and the direction is searched when the signal-to-noise ratio is lowThe accuracy of the voice recognition result is affected due to the inaccuracy. In the embodiment of the disclosure, after blind source separation, the delay-sum beam forming technology is used to further eliminate the noise influence on the observation signal data to improve the signal-to-noise ratio, thereby solving the problem of directly using the original microphone observation data X (k, τ) ═ X₁(k,τ),X₂(k,τ)]^TThe sound source orientation estimation is carried out, so that the voice recognition result is less accurate.

An exemplary embodiment of the present disclosure also provides a sound signal recognition apparatus, the structure of which is shown in fig. 6, including:

the original data acquisition module 601 is configured to acquire original observation data acquired by at least two acquisition points for at least two sound sources respectively;

a first denoising module 602, configured to perform a first-stage denoising process on the original observation data to obtain observation signal estimation data;

a positioning module 603, configured to obtain positioning information and observation signal data of each sound source according to the observation signal estimation data;

a second denoising module 604, configured to perform a second-level denoising process on the positioning information to obtain a beam enhancement output signal;

and an enhanced signal output module 605, configured to enhance the output signal according to the beam to obtain a time-domain sound source signal with an enhanced signal-to-noise ratio of each sound source.

The first noise reduction module 602 is shown in fig. 7, and includes:

the matrix initialization submodule 701 is used for initializing a separation matrix of each frequency point and a weighted covariance matrix of each sound source at each frequency point, wherein the number of rows and the number of columns of the separation matrix are the number of the sound sources;

the frequency domain data acquisition submodule 702 is configured to obtain a time domain signal at each acquisition point, and construct an observation signal matrix according to a frequency domain signal corresponding to the time domain signal;

a priori frequency domain estimation submodule 703, configured to obtain a priori frequency domain estimation of each sound source of the current frame according to the separation matrix of the previous frame and the observed signal matrix;

a covariance matrix update sub-module 704 for updating the weighted covariance matrix according to the prior frequency domain estimate;

a separation matrix updating submodule 705, configured to update the separation matrix according to the updated weighted covariance matrix;

a deblurring submodule 706 configured to deblur the updated separation matrix;

and the posterior domain estimation submodule 707 is configured to separate the original observation data according to the deblurred separation matrix, and use the posterior domain estimation data obtained through separation as the observation signal estimation data.

Further, the priori frequency domain estimation sub-module 703 is configured to separate the observed signal matrix according to the separation matrix of the previous frame, so as to obtain a priori frequency domain estimation of each sound source of the current frame.

Further, the covariance matrix update sub-module 704 is configured to update the weighted covariance matrix according to the observed signal matrix and a conjugate transpose matrix of the observed signal matrix.

Further, the structure of the separation matrix update submodule 705 is shown in fig. 8, and includes:

a first updating submodule 801, configured to update the separation matrix of each sound source according to the weighted covariance matrix of each sound source;

and a second updating sub-module 802, configured to update the separation matrix to be a conjugate transpose matrix obtained by combining the separation matrices of the sound sources.

Further, the deblurring sub-module 706 is configured to perform amplitude deblurring on the separation matrix according to a minimum distortion criterion. The deblurring process can be performed using MDP.

The structure of the positioning module 603 is shown in fig. 9, and includes:

a sound source data estimation submodule 901, configured to obtain observed signal data of each sound source at each acquisition point according to the observed signal estimation data;

the first positioning sub-module 902 is configured to estimate the direction of each sound source according to the observation signal data of each sound source at each collection point, so as to obtain positioning information of each sound source.

Further, the first positioning sub-module 902 is configured to estimate each sound source as follows, and obtain the position of each sound source:

The positioning information of the sound source includes the azimuth coordinate of the sound source, and the structure of the second noise reduction module 604 is shown in fig. 10, and includes:

a time delay calculation submodule 1001, configured to calculate propagation time delay difference values of each sound source according to the azimuth coordinates of each sound source and the azimuth coordinates of each acquisition point, where the propagation time delay difference values are time difference values of sound emitted by the sound source and transmitted to each acquisition point;

a beam summation submodule 1002, configured to perform a second-stage noise reduction on each sound source through a beam delay summation beam forming process by using the observation signal data of each sound source, to obtain the beam enhanced output signal of each sound source,

further, the enhanced signal output module 602 is configured to perform short-time inverse fourier transform on the beam enhanced output signals of each sound source, and then overlap and add the beam enhanced output signals to obtain a time-domain sound source signal of each sound source as the second-level positioning information of each sound source.

The device can be integrated in the intelligent terminal equipment or the remote operation processing platform, and part of the functional modules can be integrated in the intelligent terminal equipment and part of the functional modules can be integrated in the remote operation processing platform, and the corresponding functions can be realized by the intelligent terminal equipment and/or the remote operation processing platform.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 11 is a block diagram illustrating an apparatus 1100 for sound source localization in accordance with an exemplary embodiment. For example, the apparatus 1100 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 11, apparatus 1100 may include one or more of the following components: a processing component 1102, a memory 1104, a power component 1106, a multimedia component 1108, an audio component 1110, an input/output (I/O) interface 1112, a sensor component 1114, and a communication component 1116.

The processing component 1102 generally controls the overall operation of the device 1100, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 1102 may include one or more processors 1120 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 1102 may include one or more modules that facilitate interaction between the processing component 1102 and other components. For example, the processing component 1102 may include a multimedia module to facilitate interaction between the multimedia component 1108 and the processing component 1102.

The memory 1104 is configured to store various types of data to support operation at the device 1100. Examples of such data include instructions for any application or method operating on device 1100, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1104 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 1106 provide power to the various components of device 1100. The power components 1106 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the apparatus 1100.

The multimedia component 1108 includes a screen that provides an output interface between the device 1100 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1108 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 1100 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1110 is configured to output and/or input audio signals. For example, the audio component 1110 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 1100 is in operating modes, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1104 or transmitted via the communication component 1116. In some embodiments, the audio assembly 1110 further includes a speaker for outputting audio signals.

The I/O interface 1112 provides an interface between the processing component 1102 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 1114 includes one or more sensors for providing various aspects of state assessment for the apparatus 1100. For example, the sensor assembly 1114 may detect an open/closed state of the device 1100, the relative positioning of components, such as a display and keypad of the apparatus 1100, the sensor assembly 1114 may also detect a change in position of the apparatus 1100 or a component of the apparatus 1100, the presence or absence of user contact with the apparatus 1100, an orientation or acceleration/deceleration of the apparatus 1100, and a change in temperature of the apparatus 1100. The sensor assembly 1114 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1114 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1116 is configured to facilitate wired or wireless communication between the apparatus 1100 and other devices. The apparatus 1100 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 1116 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1116 also includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 1100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 1104 comprising instructions, executable by the processor 1120 of the apparatus 1100 to perform the method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising:

Fig. 12 is a block diagram illustrating an apparatus 1200 for sound source localization according to an exemplary embodiment. For example, the apparatus 1200 may be provided as a server. Referring to fig. 12, the apparatus 1200 includes a processing component 1222 that further includes one or more processors, and memory resources, represented by memory 1232, for storing instructions, such as application programs, that are executable by the processing component 1222. The application programs stored in memory 1232 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1222 is configured to execute instructions to perform the above-described methods.

The apparatus 1200 may also include a power supply component 1226 configured to perform power management of the apparatus 1200, a wired or wireless network interface 1250 configured to connect the apparatus 1200 to a network, and an input output (I/O) interface 1258. The apparatus 1200 may operate based on an operating system stored in the memory 1232, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

The embodiment of the disclosure provides a sound signal identification method and a sound signal identification device, which are used for acquiring original observation data acquired by at least two acquisition points for at least two sound sources respectively, then performing primary noise reduction processing on the original observation data to obtain observation signal estimation data, then obtaining positioning information and observation signal data of each sound source according to the observation signal estimation data, then performing secondary noise reduction processing on the observation signal data according to the positioning information to obtain a beam enhanced output signal, and obtaining a time domain sound source signal with enhanced signal-to-noise ratio of each sound source according to the beam enhanced output signal. After the original observation data is subjected to noise reduction processing to position the sound source, the signal-to-noise ratio is further improved through beam enhancement to highlight the signal, the problems of low sound source positioning accuracy and poor voice recognition quality under the scene of strong interference and low signal-to-noise ratio are solved, and high-efficiency and high-interference-resistance voice signal recognition is realized.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for recognizing a sound signal, comprising:

2. The method according to claim 1, wherein the step of performing a first-stage noise reduction process on the raw observation data to obtain observation signal estimation data comprises:

deblurring the updated separation matrix;

3. The method of claim 2, wherein the step of obtaining the a priori frequency domain estimates of the sound sources of the current frame according to the separation matrix of the previous frame and the observation signal matrix comprises:

4. The method of claim 2, wherein the step of updating the weighted covariance matrix based on the a priori frequency domain estimates comprises:

5. The sound signal identification method of claim 2, wherein the step of updating the separation matrix according to the updated weighted covariance matrix comprises:

6. The sound signal identification method of claim 2, wherein the step of deblurring the updated separation matrix comprises:

7. The sound signal identification method according to claim 1, wherein the step of obtaining the localization information and the observed signal data of each sound source from the observed signal estimation data comprises:

8. The sound signal identification method according to claim 7, wherein the step of estimating the orientation of each sound source from the observed signal data of each sound source at each collection point, respectively, to obtain the localization information of each sound source comprises:

9. The method according to claim 7, wherein the positioning information of the sound source includes an azimuth coordinate of the sound source, and the step of performing the second-stage noise reduction processing on the positioning information to obtain the beam-enhanced output signal includes:

10. The method of claim 9, wherein the step of obtaining the time-domain sound source signal with enhanced snr of each sound source according to the beam enhanced output signal comprises:

11. An apparatus for recognizing a sound signal, comprising:

12. The apparatus according to claim 11, wherein the first noise reduction module comprises:

a deblurring submodule for deblurring the updated separation matrix;

13. The sound signal identification device of claim 12,

and the prior frequency domain estimation submodule is used for separating the observation signal matrix according to the separation matrix of the previous frame to obtain the prior frequency domain estimation of each sound source of the current frame.

14. The sound signal identification device of claim 12,

and the covariance matrix updating submodule is used for updating the weighted covariance matrix according to the observation signal matrix and the conjugate transpose matrix of the observation signal matrix.

15. The apparatus according to claim 12, wherein the separation matrix update submodule comprises:

16. The sound signal identification device of claim 12,

and the deblurring submodule is used for carrying out amplitude deblurring processing on the separation matrix by adopting a minimum distortion criterion.

17. The apparatus according to claim 11, wherein the localization module comprises:

18. The sound signal identification device of claim 17,

the first positioning submodule is configured to estimate each sound source as follows to obtain the position of each sound source:

19. The apparatus according to claim 17, wherein the positioning information of the sound source includes azimuth coordinates of the sound source, and the second noise reduction module comprises:

and the beam summation submodule is used for performing secondary noise reduction on each sound source through beam delay summation beam forming processing by using the observation signal data of each sound source to obtain the beam enhancement output signals of each sound source.

20. The sound signal identification device of claim 19,

and the enhanced signal output module is used for performing short-time Fourier inverse transformation on the wave beam enhanced output signals of each sound source and then overlapping and adding the wave beam enhanced output signals to obtain time-domain sound source signals with enhanced signal-to-noise ratios of each sound source.

21. A computer device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

22. A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a sound signal recognition method, the method comprising: